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Abstract 



Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural lan- 
guage expressions that convey almost the same information. Textual entailment methods, on the 
other hand, recognize, generate, or extract pairs of natural language expressions, such that a human 
who reads (and trusts) the first element of a pair would most likely infer that the other element is 
also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two 
areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of 
natural language processing applications, including question answering, summarization, text gener- 
ation, and machine translation. We summarize key ideas from the two areas by considering in turn 
recognition, generation, and extraction methods, also pointing to prominent articles and resources. 

1. Introduction 

This article is a survey of computational methods for paraphrasing and textual entailment. Para- 
phrasing methods recognize, generate, or extract (e.g., from corpora) paraphrases, meaning phrases, 
sentences, or longer texts that convey the same, or almost the same information. For example, (1) 
and (2) are paraphrases. Most people would also accept (3) as a paraphrase of (1) and (2), though 
it could be argued that in (3) the construction of the bridge has not necessarily been completed, 
unlike (1) and (2). 1 Such fine distinctions, however, are usually ignored in paraphrasing and textual 
entailment work, which is why we say that paraphrases may convey almost the same information. 

(1) Wonderworks Ltd. constructed the new bridge. 

(2) The new bridge was constructed by Wonderworks Ltd. 

(3) Wonderworks Ltd. is the constructor of the new bridge. 

Paraphrasing methods may also operate on templates of natural language expressions, like (4)- 
(6); here the slots X and Y can be filled in with arbitrary noun phrases. Templates specified at 
the syntactic or semantic level may also be used, where the slot fillers may be required to have 
particular syntactic relations (e.g., verb-object) to other words or constituents, or to satisfy semantic 
constraints (e.g., requiring Y to denote a book). 

(4) X wrote Y. 

(5) Y was written by X. 

(6) X is the writer of Y. 



Readers familiar with tense and aspect theories will have recognized that (l)-(3) involve an "accomplishment" of 
Vendler's (1967) taxonomy. The accomplishment's completion point is not necessarily reached in (3), unlike (l)-(2). 
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Textual entailment methods, on the other hand, recognize, generate, or extract pairs (T,H) of 
natural language expressions, such that a human who reads (and trusts) T would infer that H is most 
likely also true (Dagan, Glickman, & Magnini, 2006). For example, (7) textually entails (8), but (9) 
does not textually entail (10). 2 

(7) The drugs that slow down Alzheimer's disease work best the earlier you administer them. 

(8) Alzheimer's disease can be slowed down using drugs. 

(9) Drew Walker, Tayside's public health director, said: "It is important to stress that this is not a confirmed 
case of rabies." 

(10) A case of rabies was confirmed. 

As in paraphrasing, textual entailment methods may operate on templates. For example, in a 
discourse about painters, composers, and their work, (11) textually entails (12), for any noun phrases 
X and Y. However, (12) does not textually entail (11), when Y denotes a symphony composed by 
X. If we require textual entailment between templates to hold for all possible slot fillers, then (11) 
textually entails (12) in our example's discourse, but the reverse does not hold. 

(11) X painted Y. 

(12) Y is the work of X. 

In general, we cannot judge if two natural language expressions are paraphrases or a correct 
textual entailment pair without selecting particular readings of the expressions, among those that 
may be possible due to multiple word senses, syntactic ambiguities etc. For example, (13) textually 
entails (14) with the financial sense of "bank", but not when (13) refers to the bank of a river. 

(13) A bomb exploded near the French bank. 

(14) A bomb exploded near a building. 

One possibility, then, is to examine the language expressions (or templates) only in particular 
contexts that make their intended readings clear. Alternatively, we may want to treat as correct any 
textual entailment pair (T,H) for which there are possible readings of T and H, such that a human 
who reads T would infer that H is most likely also true; then, if a system reports that (13) textually 
entails (14), its response is to be counted as correct, regardless of the intended sense of "bank". 
Similarly, paraphrases would have possible readings conveying almost the same information. 

The lexical substitution task of SEMEVAL (McCarthy & Navigli, 2009), where systems are re- 
quired to find an appropriate substitute for a particular word in the context of a given sentence, 
can be seen as a special case of paraphrasing or textual entailment, restricted to pairs of words. 
SEMEVAL's task, however, includes the requirement that it must be possible to use the two words 
(original and replacement) in exactly the same context. In a similar manner, one could adopt a 
stricter definition of paraphrases, which would require them not only to have the same (or almost 
the same) meaning, but also to be expressions that can be used interchangeably in grammatical sen- 
tences. In that case, although (15) and (16) are paraphrases, their underlined parts are not, because 
they cannot be swapped in the two sentences; the resulting sentences would be ungrammatical. 

(15) Edison invented the light bulb in 1879, providing a long lasting source of light. 

(16) Edison's invention of the light bulb in 1879 provided a long lasting source of light. 

A similar stricter definition of textual entailment would impose the additional requirement that H 
and T can replace each other in grammatical sentences. 

Simplified examples from RTE-2 (Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini, & Szpektor, 2006). 
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1.1 Possible Applications of Paraphrasing and Textual Entailment Methods 

The natural language expressions that paraphrasing and textual entailment methods consider are 
not always statements. In fact, many of these methods were developed having question answering 
(QA) systems in mind. In QA systems for document collections (Voorhees, 2001; Pasca, 2003; 
Harabagiu & Moldovan, 2003; Molla & Vicedo, 2007), a question may be phrased differently than 
in a document that contains the answer, and taking such variations into account can improve system 
performance significantly (Harabagiu, Maiorano, & Pasca, 2003; Duboue & Chu-Carroll, 2006; 
Harabagiu & Hickl, 2006; Riezler, Vasserman, Tsochantaridis, Mittal, & Liu, 2007). For example, 
a QA system may retrieve relevant documents or passages, using the input question as a query to an 
information retrieval or Web search engine (Baeza- Yates & Ribeiro-Neto, 1999; Manning, 2008), 
and then check if any of the retrieved texts textually entails a candidate answer (Moldovan & Rus, 
2001; Duclaye, Yvon, & Collin, 2003). 3 If the input question is (17) and the search engine returns 
passage (18), the system may check if (18) textually entails any of the candidate answers of (19), 
where we have replaced the interrogative "who" of (17) with all the expressions of (18) that a named 
entity recognizer (Bikel, Schwartz, & Weischedel, 1999; Sekine & Ranchhod, 2009) would ideally 
have recognized as person names. 4 

(17) Who sculpted the Doryphoros? 

(18) The Doryphoros is one of the best known Greek sculptures of the classical era in Western Art. The 
Greek sculptor Polykleitos designed this work as an example of the "canon" or "rule", showing the 
perfectly harmonious and balanced proportions of the human body in the sculpted form. The sculpture 
was known through the Roman marble replica found in Herculaneum and conserved in the Naples 
National Archaeological Museum, but, according to Francis Haskell and Nicholas Penny, early con- 
noisseurs passed it by in the royal Bourbon collection at Naples without notable comment. 

(19) Polykleitos/Francis Haskell/Nicholas Penny sculpted the Doryphoros. 

The input question may also be paraphrased, to allow more, potentially relevant passages to be 
obtained. Question paraphrasing is also useful when mapping user questions to lists of frequently 
asked questions (FAQs) that are accompanied by their answers (Tomuro, 2003); and natural language 
interfaces to databases often generate question paraphrases to allow users to understand if their 
queries have been understood (McKeown, 1983; Androutsopoulos, Ritchie, & Thanisch, 1995). 

Paraphrasing and textual entailment methods are also useful in several other natural language 
processing applications. In text summarization (Mani, 2001; Hovy, 2003), for example, an impor- 
tant processing stage is typically sentence extraction, which identifies the most important sentences 
of the texts to be summarized. During that stage, especially when generating a single summary from 
several documents (Barzilay & McKeown, 2005), it is important to avoid selecting sentences (e.g., 
from different news articles about the same event) that convey the same information (paraphrases) 
as other sentences that have already been selected, or sentences whose information follows from 
other already selected sentences (textual entailment). 

Sentence compression (Knight & Marcu, 2002; McDonald, 2006; Cohn & Lapata, 2008; Clarke 
& Lapata, 2008; Cohn & Lapata, 2009; Galanis & Androutsopoulos, 2010), often also a processing 
stage of text summarization, can be seen as a special case of sentence paraphrasing, as suggested by 

3 Culicover (1968) discussed different types of paraphrasing and entailment, and proposed the earliest computational 
treatment of paraphrasing and textual entailment that we are aware of, with the goal of retrieving passages of texts that 
answer natural language queries. We thank one of the anonymous reviewers for pointing us to Culicover's work. 

4 Passage (18) is based on Wikipedia's page for Doryphoros. 
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Zhao et al. (2009), with the additional constraint that the resulting sentence must be shorter than the 
original one and still grammatical; for example, a sentence matching (5) or (6) could be shortened by 
converting it to a paraphrase of the form of (4). Most sentence compression work, however, allows 
less important information of the original sentence to be discarded. Hence, the resulting sentence is 
entailed by, it is not necessarily a paraphrase of the original one. In the following example, (21) is 
a compressed form of (20) produced by a human. 5 

(20) Mother Catherine, 82, the mother superior, will attend the hearing on Friday, he said. 

(21) Mother Catherine, 82, the mother superior, will attend. 

When the compressed sentence is not necessarily a paraphrase of the original one, we may first 
produce (grammatical) candidate compressions that are textually entailed by the original sentence; 
hence, a mechanism to generate textually entailed sentences is useful. Additional mechanisms are 
needed, however, to rank the candidates depending on the space they save and the degree to which 
they maintain important information; we do not discuss additional mechanisms of this kind. 

Information extraction systems (Grishman, 2003; Moens, 2006) often rely on manually or au- 
tomatically crafted patterns (Muslea, 1999) to locate text snippets that report particular types of 
events and to identify the entities involved; for example, patterns like (22)-(24), or similar patterns 
operating on syntax trees, possibly with additional semantic constraints, might be used to locate 
snippets referring to bombing incidents and identify their targets. Paraphrasing or textual entail- 
ment methods can be used to generate additional semantically equivalent extraction patterns (in the 
case of paraphrasing) or patterns that textually entail the original ones (Shinyama & Sekine, 2003). 

(22) X was bombed 

(23) bomb exploded near X 

(24) explosion destroyed X 

In machine translation (Koehn, 2009), ideas from paraphrasing and textual entailment research 
have been embedded in measures and processes that automatically evaluate machine-generated 
translations against human-authored ones that may use different phrasings (Lepage & Denoual, 
2005; Zhou, Lin, & Hovy, 2006a; Kauchak & Barzilay, 2006; Pado, Galley, Jurafsky, & Manning, 
2009); we return to this issue in following sections. Paraphrasing methods have also been used 
to automatically generate additional reference translations from human-authored ones when train- 
ing machine translation systems (Madnani, Ay an, Resnik, & Dorr, 2007). Finally, paraphrasing 
and textual entailment methods have been employed to allow machine translation systems to cope 
with source language words and longer phrases that have not been encountered in training corpora 
(Zhang & Yamamoto, 2005; Callison-Burch, Koehn, & Osborne, 2006a; Marton, Callison-Burch, & 
Resnik, 2009; Mirkin, Specia, Cancedda, Dagan, Dymetman, & Szpektor, 2009b). To use an exam- 
ple of Mirkin et al. (2009b), a phrase -based machine translation system that has never encountered 
the expression "file a lawsuit" during its training, but which knows that pattern (25) textually entails 
(26), may be able to produce a more acceptable translation by converting (27) to (28), and then 
translating (28). Some information would be lost in the translation, because (28) is not a paraphrase 
of (27), but the translation may still be preferable to the outcome of translating directly (27). 

(25) X filed a lawsuit against Y for Z. 

5 Example from Clarke et al.s paper, "Written News Compression Corpus (Clarke & Lapata, 2008); see Appendix A. 
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(26) X accused Y of Z. 

(27) Cisco filed a lawsuit against Apple for patent violation. 

(28) Cisco accused Apple of patent violation. 

In natural language generation (Reiter & Dale, 2000; Bateman & Zock, 2003), for example 
when producing texts describing the entities of a formal ontology (O'Donnell, Mellish, Oberlan- 
der, & Knott, 2001; Androutsopoulos, Oberlander, & Karkaletsis, 2007), paraphrasing can be used 
to avoid repeating the same phrasings (e.g., when expressing properties of similar entities), or to 
produce alternative expressions that improve text coherence, adhere to writing style (e.g., avoid 
passives), or satisfy other constraints (Power & Scott, 2005). Among other possible applications, 
paraphrasing and textual entailment methods can be employed to simplify texts, for example by re- 
placing specialized (e.g., medical) terms with expressions non-experts can understand (Elhadad & 
Sutaria, 2007; Deleger & Zweigenbaum, 2009), and to automatically score student answers against 
reference answers (Nielsen, Ward, & Martin, 2009). 

1.2 The Relation of Paraphrasing and Textual Entailment to Logical Entailment 

If we represent the meanings of natural language expressions by logical formulae, for example in 
first-order predicate logic, we may think of textual entailment and paraphrasing in terms of logical 
entailment (|=). If the logical meaning representations of T and H are fa and fa, then {T,H) is a 
correct textual entailment pair if and only if (fa AS) |= fa; B is a knowledge base, for simplicity 
assumed here to have the form of a single conjunctive formula, which contains meaning postulates 
(Carnap, 1952) and other knowledge assumed to be shared by all language users. 6 Let us consider 
the example below, where logical terms starting with capital letters are constants; we assume that 
different word senses would give rise to different predicate symbols. Let us also assume that B 
contains only iff- Then (fa A iff) 1= holds, i.e., fa is true for any interpretation (e.g., model- 
theoretic) of constants, predicate names and other domain-dependent atomic symbols, for which fa 
and y both hold. A sound and complete automated reasoner (e.g., based on resolution, in the case 
of first-order predicate logic) could be used to confirm that the logical entailment holds. Hence, T 
textually entails H, assuming again that the meaning postulate iff is available. The reverse, however, 
does not hold, i.e., (fa A iff) ^ fa; the implication (=>) of \ff would have to be made bidirectional 
for the reverse to hold. 



Similarly, if the logical meaning representations of T\ and Ti are 0i and fa, then T\ is a para- 
phrase of T2 iff (0i AS) |= fa and (fa Afi) |= fa where again B contains meaning postulates and 
common sense knowledge. Ideally, sentences like (l)-(3) would be represented by the same for- 
mula, making it clear that they are paraphrases, regardless of the contents of B. Otherwise, it may 

6 Zaenen et al. (2005) provide examples showing that linguistic and world knowledge cannot often be separated. 



fa 
H 

fa 



T 



Leonardo da Vinci painted the Mona Lisa. 

isPainterOf(DaVinci,MonaLisa) 

Mona Lisa is the work of Leonardo da Vinci. 

LsWorkOf(MonaLisa,DaVinci) 

VxVy isPainterOf(x,y) => isWorkOf(y,x) 
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sometimes be unclear if T\ and T2 should be considered paraphrases, because it may be unclear if 
some knowledge should be considered part of B. 

Since natural language expressions are often ambiguous, especially out of context, we may again 
want to adopt looser definitions, so that T textually entails H iff there are possible readings of T and 
H, represented by 0# and <j>r, such that (<j>T AB) \= <j>n, and similarly for paraphrases. Thinking of 
textual entailment and paraphrasing in terms of logical entailment allows us to borrow notions and 
methods from logic. Indeed, some paraphrasing and textual entailment recognition methods map 
natural language expressions to logical formulae, and then examine if logical entailments hold. This 
is not, however, the only possible approach. Many other, if not most, methods currently operate on 
surface strings or syntactic representations, without mapping natural language expressions to formal 
meaning representations. Note, also, that in methods that map natural language to logical formulae, 
it is important to work with a form of logic that provides adequate support for logical entailment 
checks; full first-order predicate logic may be inappropriate, as it is semi-decidable. 

To apply our logic-based definition of textual entailment, which was formulated for statements, 
to questions, let us use identical fresh constants (in effect, Skolem constants) across questions to 
represent the unknown entities the questions ask for; we mark such constants with question marks 
as subscripts, but in logical entailment checks they can be treated as ordinary constants. In the fol- 
lowing example, the user asks H, and the system generates T. Assuming that the meaning postulate 
Y is available in B, (0r AB) \= i.e., for any interpretation of the predicate symbols and constants, 
if ((pT AB) is true, then 0# is necessarily also true. Hence, T textually entails H. In practice, this 
means that if the system manages to find an answer to T, perhaps because T's phrasing is closer to 
a sentence in a document collection, the same answer can be used to respond to H. 



A logic-based definition of question paraphrases can be formulated in a similar manner, as 
bidirectional logical entailment. Note also that logic-based paraphrasing and textual entailment 
methods may actually represent interrogatives as free variables, instead of fresh constants, and they 
may rely on unification to obtain their values (Moldovan & Rus, 2001; Rinaldi, Dowdall, Kaljurand, 
Hess, & Molla, 2003). 

1.3 A Classification of Paraphrasing and Textual Entailment Methods 

There have been six workshops on paraphrasing and/or textual entailment (Sato & Nakagawa, 2001 ; 
Inui & Hermjakob, 2003; Dolan & Dagan, 2005; Drass & Yamamoto, 2005; Sekine, Inui, Dagan, 
Dolan, Giampiccolo, & Magnini, 2007; Callison-Burch, Dagan, Manning, Pennacchiotti, & Zan- 
zotto, 2009) in recent years. 7 The Recognizing Textual Entailment (RTE) challenges (Dagan et al., 
2006; Bar-Haim et al., 2006; Giampiccolo, Magnini, Dagan, & Dolan, 2007; Giampiccolo, Dang, 



7 The proceedings of the five more recent workshops are available in the ACL Anthology. 



T (generated) 
(j> T 

H(asked) 



Who painted the Mona Lisa? 
isAgent(W'i) A isPainterOf(W'i ,MonaLisa) 
Whose work is the Mona Lisa? 
isAgent(W-)) A isWorkOf(MonaLisa, W?) 
VxVy isPainterOf(x,y) LsWorkOf(y,x) 
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Magnini, Dagan, & Dolan, 2008), currently in their fifth year, provide additional significant thrust. 
Consequently, there is a large number of published articles, proposed methods, and resources re- 
lated to paraphrasing and textual entailment. 9 A special issue on textual entailment was also recently 
published, and its editorial provides a brief overview of textual entailment methods (Dagan, Dolan, 
Magnini, & Roth, 2009). 10 To the best of our knowledge, however, the present article is the first 
extensive survey of paraphrasing and textual entailment. 

To provide a clearer view of the different goals and assumptions of the methods that have been 
proposed, we classify them along two dimensions: whether they are paraphrasing or textual en- 
tailment methods; and whether they perform recognition, generation, or extraction of paraphrases 
or textual entailment pairs. These distinctions are not always clear in the literature, especially the 
distinctions along the second dimension, which we explain below. It is also possible to classify 
methods along other dimensions, for example depending on whether they operate on language ex- 
pressions or templates; or whether they operate on phrases, sentences or longer texts. 

The main input to a paraphrase or textual entailment recognizer is a pair of language expressions 
(or templates), possibly in particular contexts. The output is a judgement, possibly probabilistic, in- 
dicating whether or not the members of the input pair are paraphrases or a correct textual entailment 
pair; the judgements must agree as much as possible with those of humans. On the other hand, the 
main input to a paraphrase or textual entailment generator is a single language expression (or tem- 
plate) at a time, possibly in a particular context. The output is a set of paraphrases of the input, or 
a set of language expressions that entail or are entailed by the input; the output set must be as large 
as possible, but including as few errors as possible. In contrast, no particular language expressions 
or templates are provided to a paraphrase or textual entailment extractor. The main input in this 
case is a corpus, for example a monolingual corpus of parallel or comparable texts, such as different 
English translations of the same French novel, or clusters of multiple monolingual news articles, 
with the articles in each cluster reporting the same event. The system outputs pairs of paraphrases 
(possibly templates), or pairs of language expressions (or templates) that constitute correct textual 
entailment pairs, based on the evidence of the corpus; the goal is again to produce as many output 
pairs as possible, with as few errors as possible. Note that the boundaries between recognizers, gen- 
erators, and extractors may not always be clear. For example, a paraphrase generator may invoke a 
paraphrase recognizer to filter out erroneous candidate paraphrases; and a recognizer or a generator 
may consult a collection of template pairs produced by an extractor. 

We note that articles reporting actual applications of paraphrasing and textual entailment meth- 
ods to larger systems (e.g., for QA, information extraction, machine translation, as discussed in 
Section 1.1) are currently relatively few, compared to the number of articles that propose new para- 
phrasing and textual entailment methods or that test them in vitro, despite the fact that articles of the 
second kind very often point to possible applications of the methods they propose. The relatively 
small number of application articles may be an indicator that paraphrasing and textual entailment 
methods are not used extensively in larger systems yet. We believe that this may be due to at least 
two reasons. First, the efficiency of the methods needs to be improved, which may require com- 
bining recognition, generation, and extraction methods, for example to iteratively produce more 
training data; we return to this point in following sections. Second, the literature on paraphrasing 



The RTE challenges were initially organized by the European PASCAL Network of Excellence, and subsequently as 
part of NIST's Text Analysis Conference. 

9 A textual entailment portal has been established, as part of ACL's wiki, to help organize all relevant material. 
10 The slides of Dagan, Roth, and Zazotto's ACL 2007 tutorial on textual entailment are also publicly available. 



7 



Androutsopoulos & Malakasiotis 



and textual entailment is vast, which makes it difficult for researchers working on larger systems to 
assimilate its key concepts and identify suitable methods. We hope that this article will help address 
the second problem, while also acting as an introduction that may help new researchers improve 
paraphrasing and textual entailment methods further. 

In Sections 2, 3, and 4 below we consider in turn recognition, generation, and extraction methods 
for both paraphrasing and textual entailment. In each of the three sections, we attempt to identify 
and explain prominent ideas, pointing also to relevant articles and resources. In Section 5, we 
conclude and discuss some possible directions for future research. The URLs of all publicly available 
resources that we mention are listed in appendix A. 

2. Paraphrase and Textual Entailment Recognition 

Paraphrase and textual entailment recognizers judge whether or not two given language expressions 
(or templates) constitute paraphrases or a correct textual entailment pair. Different methods may 
operate at different levels of representation of the input expressions; for example, they may treat the 
input expressions simply as surface strings, they may operate on syntactic or semantic representa- 
tions of the input expressions, or on representations combining information from different levels. 

2.1 Logic-based Approaches to Recognition 

One possibility is to map the language expressions to logical meaning representations, and then 
rely on logical entailment checks, possibly by invoking theorem pro vers (Rinaldi et al., 2003; Bos 
& Markert, 2005; Tatu & Moldovan, 2005, 2007). In the case of textual entailment, this involves 
generating pairs of formulae (<Pt,<Ph) for T and H (or their possible readings), and then checking 
if (0r A B) |= 0//, where B contains meaning postulates and common sense knowledge, as already 
discussed. In practice, however, it may be very difficult to formulate a reasonably complete B. A 
partial solution to this problem is to obtain common sense knowledge from resources like WordNet 
(Fellbaum, 1998) or Extended WordNet (Moldovan & Rus, 2001). The latter also includes logical 
meaning representations extracted from WordNet's glosses. For example, since "assassinate" is a 
hyponym (more specific sense) of "kill" in WordNet, an axiom like the following can be added to B 
(Moldovan & Rus, 2001; Bos & Markert, 2005; Tatu & Moldovan, 2007). 

VjcVv assassinate(x,y) => kill(x,y) 

Additional axioms can be obtained from FrameNet's frames (Baker, Fillmore, & Lowe, 1998; 
Lonneker-Rodman & Baker, 2009), as discussed for example by Tatu et al. (2005), or similar re- 
sources. Roughly speaking, a frame is the representation of a prototypical situation (e.g., a pur- 
chase), which also identifies the situation's main roles (e.g., the buyer, the entity bought), the types 
of entities (e.g., person) that can play these roles, and possibly relations (e.g., causation, inheritance) 
to other prototypical situations (other frames). VerbNet (Schuler, 2005) also specifies, among other 
information, semantic frames for English verbs. On-line encyclopedias have also been used to ob- 
tain background knowledge by extracting particular types of information (e.g., is-a relationships) 
from their articles (Iftene & Balahur-Dobrescu, 2007). 

Another approach is to use no particular B (meaning postulates and common sense knowledge), 
and measure how difficult it is to satisfy both <pr and <pn, in the case of textual entailment recog- 
nition, compared to satisfying (pr on its own. A possible measure is the difference of the size of 
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the minimum model that satisfies both 0r and compared to the size of the minimum model that 
satisfies <pr on its own (Bos & Markert, 2005); intuitively, a model is an assignment of entities, 
relations etc. to terms, predicate names, and other domain-dependent atomic symbols. The greater 
this difference the more knowledge is required in B for AS) |= §h to hold, and the more difficult 
it becomes for speakers to accept that T textually entails H. Similar bidirectional logical entailment 
checks can be used to recognize paraphrases (Rinaldi et al., 2003). 

2.2 Recognition Approaches that Use Vector Space Models of Semantics 

An alternative to using logical meaning representations is to start by mapping each word of the 
input language expressions to a vector that shows how strongly the word cooccurs with particular 
other words in corpora (Lin, 1998b), possibly also taking into account syntactic information, for 
example requiring that the cooccurring words participate in particular syntactic dependencies (Pado 
& Lapata, 2007). A compositional vector-based meaning representation theory can then be used to 
combine the vectors of single words, eventually mapping each one of the two input expressions to a 
single vector that attempts to capture its meaning; in the simplest case, the vector of each expression 
could be the sum or product of the vectors of its words, but more elaborate approaches have also 
been proposed (Mitchell & Lapata, 2008; Erk & Pado, 2009; Clarke, 2009). Paraphrases can then 
be detected by measuring the distance of the vectors of the two input expressions, for example by 
computing their cosine similarity. See also the work of Turney and Pantel (2010) for a survey of 
vector space models of semantics. 

Recognition approaches based on vector space models of semantics appear to have been ex- 
plored much less than other approaches discussed in this article, and mostly in paraphrase recogni- 
tion (Erk & Pado, 2009). They could also be used in textual entailment recognition, however, by 
checking if the vector of H is particularly close to that of a part (e.g., phrase or sentence) of T. Intu- 
itively, this would check if what H says is included in what T says, though we must be careful with 
negations and other expressions that do not preserve truth values (Zaenen et al, 2005; MacCartney 
& Manning, 2009), as in (29)-(30). We return to the idea of matching H to a part of T below. 

(29) T: He denied that BigCo bought SmallCo. 

(30) H: BigCo bought SmallCo. 

2.3 Recognition Approaches Based on Surface String Similarity 

Several paraphrase recognition methods operate directly on the input surface strings, possibly after 
applying some pre-processing, such as part-of-speech (POS) tagging or named-entity recognition, 
but without computing more elaborate syntactic or semantic representations. For example, they 
may compute the string edit distance (Levenshtein, 1966) of the two input strings, the number of 
their common words, or combinations of several string similarity measures (Malakasiotis & An- 
droutsopoulos, 2007), including measures originating from machine translation evaluation (Finch, 
Hwang, & Sumita, 2005; Perez & Alfonseca, 2005; Zhang & Patrick, 2005; Wan, Dras, Dale, & 
Paris, 2006). The latter have been developed to automatically compare machine-generated trans- 
lations against human- authored reference translations. A well known measure is BLEU (Papineni, 
Roukos, Ward, & Zhu, 2002; Zhou et al., 2006a), which roughly speaking examines the percentage 
of word «-grams (sequences of consecutive words) of the machine-generated translations that also 
occur in the reference translations, and takes the geometric average of the percentages obtained for 
different values of n. Although such «-gram based measures have been criticised in machine transla- 
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tion evaluation (Callison-Burch, Osborne, & Koehn, 2006b), for example because they are unaware 
of synonyms and longer paraphrases, they can be combined with other measures to build para- 
phrase (and textual entailment) recognizers (Zhou et al., 2006a; Kauchak & Barzilay, 2006; Pado 
et al., 2009), which may help address the problems of automated machine translation evaluation. 

In textual entailment recognition, one of the input language expressions (T) is often much longer 
than the other one (H). If a part of T's surface string is very similar to H's, this is an indication 
that H may be entailed by T. This is illustrated in (31)— (32), where H is included verbatim in T. n 
Note, however, that the surface string similarity (e.g., measured by string edit distance) between H 
and the entire T of this example is low, because of the different lengths of T and H. 

(31) T: Charles de Gaulle died in 1970 at the age of eighty. He was thus fifty years old when, as an un- 
known officer recently promoted to the rank of brigadier general, he made his famous broadcast from 
London rejecting the capitulation of France to the Nazis after the debacle of May-June 1940. 

(32) H: Charles de Gaulle died in 1970. 

Comparing H to a sliding window of T's surface string of the same size as H (in our example, six 
consecutive words of T) and keeping the largest similarity score between the sliding window and 
H may provide a better indication of whether T entails H or not (Malakasiotis, 2009). In many 
correct textual entailment pairs, however, using a single sliding window of a fixed length may still 
be inadequate, because H may correspond to several non-continuous parts of T; in (33)-(34), for 
example, H corresponds to the three underlined parts of T. n 

(33) T: The Gaspe, also known as la Gaspesie in French, isa North American peninsula on the south shore 
of the Saint Lawrence River, in Quebec. 

(34) H: The Gaspe is a peninsula in Quebec. 

One possible solution is to attempt to align the words (or phrases) of H to those of T, and 
consider T-H a correct textual entailment pair if a sufficiently good alignment is found, in the 
simplest case if a large percentage of 7"s words are aligned to words of H. Another approach would 
be to use a window of variable length; the window could be, for example, the shortest span of T 
that contains all of Z"s words that are aligned to words of H (Burchardt, Pennacchiotti, Thater, & 
Pinkal, 2009). In any case, we need to be careful with negations and other expressions that do 
not preserve truth values, as already mentioned. Note, also, that although effective word alignment 
methods have been developed in statistical machine translation (Brown, Delia Pietra, Delia Pietra, 
& Mercer, 1993; Vogel, Ney, & Tillmann, 1996; Och & Ney, 2003), they often perform poorly on 
textual entailment pairs, because T and H are often of very different lengths, they do not necessarily 
convey the same information, and textual entailment training datasets are much smaller than those 
used in machine translation; see MacCartney et al.'s (2008) work for further related discussion and 
a word alignment method developed especially for textual entailment pairs. 13 

2.4 Recognition Approaches Based on Syntactic Similarity 

Another common approach is to work at the syntax level. Dependency grammar parsers (Melcuk, 
1987; Kubler, McDonald, & Nivre, 2009) are popular in paraphrasing and textual entailment re- 

1 'Example from the dataset of RTE-3 (Giampiccolo et al., 2007). 
12 Modified example from the dataset of the RTE-3 (Giampiccolo et al., 2007). 

13 Cohn et al. (2008) discuss how a publicly available corpus with manually word-aligned paraphrases was constructed. 
Other word-aligned paraphrasing or textual entailment datasets can be found at the ACL Textual Entailment Portal. 
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The problem was solved by a young mathematician 
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Figure 1: Two sentences that are very similar when viewed at the level of dependency trees. 

search, as in other natural language processing areas recently. Instead of showing hierarchically the 
syntactic constituents (e.g., noun phrases, verb phrases) of a sentence, the output of a dependency 
grammar parser is a graph (usually a tree) whose nodes are the words of the sentence and whose 
(labeled) edges correspond to syntactic dependencies between words, for example the dependency 
between a verb and the head noun of its subject noun phrase, or the dependency between a noun 
and an adjective that modifies it. Figure 1 shows the dependency trees of two sentences. The exact 
form of the trees and the edge labels would differ, depending on the parser; for simplicity, we show 
prepositions as edges. If we ignore word order and the auxiliary "was" of the passive (right) sen- 
tence, and if we take into account that the by edge of the passive sentence corresponds to the subj 
edge of the active (left) one, the only difference is the extra adjective of the passive sentence. Hence, 
it is easy to figure out from the dependency trees that the two sentences have very similar meanings, 
despite their differences in word order. Strictly speaking, the right sentence textually entails the left 
one, not the reverse, because of the word "young" in the right sentence. 

Some paraphrase recognizers simply count the common edges of the dependency trees of the 
input expressions (Wan et al., 2006; Malakasiotis, 2009) or use other tree similarity measures. A 
large similarity score (e.g., above a threshold) indicates that the input expressions may be para- 
phrases. Tree edit distance (Selkow, 1977; Tai, 1979; Zhang & Shasha, 1989) is another example 
of a similarity measure that can be applied to dependency or other parse trees; it computes the se- 
quence of operator applications (e.g., add, replace, or remove a node or edge) with the minimum 
cost that turns one tree into the other. 14 To obtain more accurate predictions, it is important to devise 
an appropriate inventory of operators and assign appropriate costs to the operators during a training 
stage (Kouylekov & Magnini, 2005; Mehdad, 2009; Harmeling, 2009). For example, replacing a 
noun with one of its synonyms should be less costly than replacing it with an unrelated word; and 
removing a dependency between a verb and an adverb should perhaps be less costly than removing 
a dependency between a verb and the head noun of its subject or object. 

In textual entailment recognition, one may compare //'s parse tree against subtrees of T's parse 
tree (Iftene & Balahur-Dobrescu, 2007; Zanzotto, Pennacchiotti, & Moschitti, 2009). It may be 
possible to match H's tree against a single subtree of T, in effect a single syntactic window on T, 
as illustrated in Figure 2, which shows the dependency trees of (33)-(34); recall that (34) does not 
match a single window of (33) at the surface string level. 15 This is also a further example of how 
operating at a higher level than surface strings may reveal similarities that may be less clear at lower 



EDITS, a suite to recognize textual entailment by computing edit distances, is publicly available. 

Figure 2 is based on the output of Stanford's parser. One might argue that "North" should modify "American". 
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Figure 2: An example of how dependency trees may make it easier to match a short sentence (sub- 
tree inside the dashed line) to a part of a longer one. 



levels. Another example is (35)-(36); although (35) includes verbatim (36), it does not textually 
entail (36). 16 This is clear when one compares the syntactic representations of the two sentences: 
Israel is the subject of "was established" in (36), but not in (35). The difference, however, is not 
evident at the surface string level, and a sliding window of (35) would match exactly (36), wrongly 
suggesting a textual entailment. 

(35) T: The National Institute for Psychobiology in Israel was established in 1979. 

(36) H: Israel was established in 1979. 

Similar arguments can be made in favour of computing similarities at the semantic level (Qiu, 
Kan, & Chua, 2006); for example, both the active and passive forms of a sentence may be mapped 
to the same logical formula, making their similarity clearer than at the surface or syntax level. The 
syntactic or semantic representations of the input expressions, however, cannot always be computed 
accurately (e.g., due to parser errors), which may introduce noise; and, possibly because of the 
noise, methods that operate at the syntactic or semantic level do not necessarily outperform in 
practice methods that operate on surface strings (Wan et al., 2006; Burchardt, Reiter, Thater, & 
Frank, 2007; Burchardt et al., 2009). 

2.5 Recognition via Similarity Measures Operating on Symbolic Meaning Representations 

Paraphrases may also be recognized by computing similarity measures on graphs whose edges do 
not correspond to syntactic dependencies, but reflect semantic relations mentioned in the input ex- 
pressions (Haghighi, 2005), for example the relation between a buyer and the entity bought. Rela- 
tions of this kind may be identified by applying semantic role labeling methods (Marquez, Carreras, 

16 Modified example from Haghighi et al.'s (2005) work. 
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Litkowski, & Stevenson, 2008) to the input language expressions. It is also possible to compute 
similarities between meaning representations that are based on FrameNet's frames (Burchardt et al., 
2007). The latter approach has the advantage that semantically related expressions may invoke the 
same frame (as with "announcement", "announce", "acknowledge") or interconnected frames (e.g., 
FrameNet links the frame invoked by "arrest" to the frame invoked by "trial" via a path of tempo- 
ral precedence relations), making similarities and implications easier to capture (Burchardt et al., 
2009). 17 The prototypical semantic roles that PropBank (Palmer, Gildea, & Kingsbury, 2005) asso- 
ciates with each verb may also be used in a similar manner, instead of FrameNet's frames. Similarly, 
in the case of textual entailment recognition, one may compare //'s semantic representation (e.g., 
semantic graph or frame) to parts of 7"s representation. 

WordNet (Fellbaum, 1998), automatically constructed collections of near synonyms (Lin, 1998a; 
Moore, 2001; Brockett & Dolan, 2005), or resources like NOMLEX (Meyers, Macleod, Yangarber, 
Grishman, Barrett, & Reeves, 1998) and CatVar (Habash & Dorr, 2003) that provide nominaliza- 
tions of verbs and other derivationally related words across different POS categories (e.g., "to invent" 
and "invention"), can be used to match synonyms, hypernyms-hyponyms, or, more generally, se- 
mantically related words across the two input expressions. According to WordNet, in (37)-(38) 
"shares" is a direct hyponym (more specific meaning) of "stock", "slumped" is a direct hyponym 
of "dropped", and "company" is an indirect hyponym (two levels down) of "organization". 18 By 
treating semantically similar words (e.g., synonyms, or hypernyms-hyponyms up to a small hierar- 
chical distance) as identical (Rinaldi et al., 2003; Finch et al., 2005; Tatu, lies, Slavick, Novischi, & 
Moldovan, 2006; Iftene & Balahur-Dobrescu, 2007; Malakasiotis, 2009; Harmeling, 2009), or by 
considering (e.g., counting) semantically similar words across the two input language expressions 
(Brockett & Dolan, 2005; Bos & Markert, 2005), paraphrase recognizers may be able to cope with 
paraphrases that have very similar meanings, but very few or no common words. 

(37) The shares of the company dropped. 

(38) The organization's stock slumped. 

In textual entailment recognition, it may be desirable to allow the words of T to be more distant 
hyponyms of the words of H, compared to paraphrase recognition. For example, "X is a computer" 
textually entails "X is an artifact", and "computer" is a hyponym of "artifact" four levels down. 

Measures that exploit WordNet (or similar resources) and compute the semantic similarity 
between two words or, more generally, two texts have also been proposed (Leacock, Miller, & 
Chodorow, 1998; Lin, 1998c; Resnik, 1999; Budanitsky & Hirst, 2006; Tsatsaronis, Varlamis, & 
Vazirgiannis, 2010). 19 Some of them are directional, making them more suitable to textual entail- 
ment recognition (Corley & Mihalcea, 2005). Roughly speaking, measures of this kind consider 
(e.g., sum the lengths of) the paths in WordNet's hierarchies (or similar resources) that connect the 
senses of corresponding (e.g., most similar) words across the two texts. They may also take into 
account information such as the frequencies of the words in the two texts and how rarely they are 
encountered in documents of a large collection (inverse document frequency). The rationale is that 
frequent words of the input texts that are rarely used in a general corpus are more important, as in 

17 Consult, for example, the work of Erk and Pado (2006) for a description of a system that can annotate texts with 
FrameNet frames. The FATE corpus (Burchardt & Pennacchiotti, 2008), a version of the RTE 2 test set (Bar-Haim et al., 
2006) with FrameNet annotations, is publicly available. 

18 Modified example from the work of Tsatsaronis (2009) 

19 Pedersen's WordNet:: Similarity package implements many of these measures. 
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Figure 3: Paraphrase and textual entailment recognition via supervised machine learning. 



information retrieval; hence, the paths that connect them should be assigned greater weights. Since 
they often consider paths between word senses, many of these measures would ideally be combined 
with word sense disambiguation (Yarowski, 2000; Stevenson & Wilks, 2003; Kohomban & Lee, 
2005; Navigli, 2008), which is not, however, always accurate enough for practical purposes. 

2.6 Recognition Approaches that Employ Machine Learning 

Multiple similarity measures, possibly computed at different levels (surface strings, syntactic or 
semantic representations) may be combined by using machine learning (Mitchell, 1997; Alpay- 
din, 2004), as illustrated in Figure 3. 20 Each pair of input language expressions {P\,P2}, i.e., each 
pair of expressions we wish to check if they are paraphrases or a correct textual entailment pair, 
is represented by a feature vector (fi,...,f m ). The vector contains the scores of multiple simi- 
larity measures applied to the pair, and possibly other features. For example, many systems also 
include features that check for polarity differences across the two input expressions, as in "this is 
not a confirmed case of rabies" vs. "a case of rabies was confirmed", or modality differences, as 
in "a case may have been confirmed" vs. "a case has been confirmed" (Haghighi, 2005; Iftene & 
Balahur-Dobrescu, 2007; Tatu & Moldovan, 2007). Bos and Markert (2005) also include features 
indicating if a theorem prover has managed to prove that the logical representation of one of the 
input expressions entails the other or contradicts it. A supervised machine learning algorithm trains 
a classifier on manually classified (as correct or incorrect) vectors corresponding to training input 
pairs. Once trained, the classifier can classify unseen pairs as correct or incorrect paraphrases or 
textual entailment pairs by examining their features (Bos & Markert, 2005; Brockett & Dolan, 2005; 
Zhang & Patrick, 2005; Finch et al., 2005; Wan et al., 2006; Burchardt et al., 2007; Hickl, 2008; 
Malakasiotis, 2009; Nielsen et al, 2009). 

A preprocessing stage is commonly applied to each input pair of language expressions, before 
converting it to a feature vector (Zhang & Patrick, 2005). Part of the preprocessing may provide 



WEKA (Witten & Frank, 2005) provides implementations of several well known machine learning algorithms, in- 
cluding C4.5 (Quinlan, 1993), Naive Bayes (Mitchell, 1997), SVMs (Vapnik, 1998; Cristianini & Shawe-Taylor, 2000; 
Joachims, 2002), and AdaBoost (Freund & Schapire, 1995; Friedman, Hastie, & Tibshirani, 2000). More efficient im- 
plementations of SVMs, such as LIBSVM and SVM-LIGHT, are also available. Maximum Entropy classifiers are also very 
effective; see chapter 6 of the book "Speech and Language Processing" (Jurafsky & Martin, 2008) for an introduction; 
Stanford's implementation is frequently used. 
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information that is required to compute the features; for example, this is when a POS tagger or a 
parser would be applied. 21 The preprocessing may also normalize the input pairs; for example, a 
stemmer may be applied; dates may be converted to a consistent format; names of persons, organi- 
zations, locations etc. may be tagged by their semantic categories using a named entity recognizer; 
pronouns or, more generally, referring expressions, may be replaced by the expressions they refer to 
(Hobbs, 1986; Lappin & Leass, 1994; Mitkov, 2003; Molla, Schwitter, Rinaldi, Dowdall, & Hess, 
2003; Yang, Su, & Tan, 2008); and morphosyntactic variations may be normalized (e.g., passive 
sentences may be converted to active ones). 22 

Instead of mapping each (P\,Pi) pair to a feature vector that contains mostly scores measur- 
ing the similarity between Pi and P2, it is possible to use vectors that encode directly parts of Pi 
and P2, or parts of their syntactic or semantic representations. Zanzotto et al. (2009) project each 
(Pi, Pi) pair to a vector that, roughly speaking, contains as features all the fragments of Pi and P2's 
parse trees. Leaf nodes corresponding to identical or very similar words (according to a WordNet- 
based similarity measure) across Pi and P2 are replaced by co-indexed slots, to allow the features 
to be more general. Zanzotto et al. define a measure (actually, different versions of it) that, in ef- 
fect, computes the similarity of two pairs (Pi, Pi) and (P[,P! 1 ) by counting the parse tree fragments 
(features) that are shared by Pi and P[, and those shared by P2 and P' 2 . The measure is used as a 
kernel in an Support Vector Machine (SVM) that learns to separate positive textual entailment pairs 
(Pi , Pi) = (T, H) from negative ones. A (valid) kernel can be thought of as a similarity measure that 
projects two objects to a highly dimensional vector space, where it computes the inner product of 
the projected objects; efficient kernels compute the inner product directly from the original objects, 
without computing their projections to the highly dimensional vector space (Vapnik, 1998; Cris- 
tianini & Shawe-Taylor, 2000; Joachims, 2002). In Zanzotto et al.'s work, each object is a (T,H) 
pair, and its projection is the vector that contains all the parse tree fragments of T and H as features. 
Consult, for example, the work of Zanzotto and Dell' Arciprete (2009) and Moschitti (2009) for 
further discussion of kernels that can be used in paraphrase and textual entailment recognition. 

2.7 Recognition Approaches Based on Decoding 

Pairs of paraphrasing or textual entailment expressions (or templates) like (39), often called rules, 
that may have been produced by extraction mechanisms (to be discussed in Section 4) can be used 
by recognizers much as, and often in addition to synonyms and hypernyms-hyponyms. 

(39) "X is fond of Y" w "X likes Y" 

Given the paraphrasing rule of (39) and the information that "child" is a synonym of "kid" and 
"candy" a hyponym of "sweet", a recognizer could figure out that (40) textually entails (43) by 
gradually transforming (40) to (43) as shown below. 23 

(40) Children are fond of sweets. 

21 Brill's (1992) POS tagger is well-known and publicly available. Stanford's tagger (Toutanova, Klein, Manning, & 
Singer, 2003) is another example of a publicly available POS tagger. Commonly used parsers include Charniak's (2000), 
Collin's (2003), the Link Grammar Parser (Sleator & Temperley, 1993), MINIPAR, a principle-based parser (Berwick, 
1991) very similar to PRINCIPAR (Lin, 1994), MaltParser (Nivre, Hall, Nilsson, Chanev, Eryigit, Kuebler, Marinov, & 
Marsi, 2007), and Stanford's parser (Klein & Manning, 2003). 

22 Porter's stemmer (1997) is well-known. An example of a publicly available named-entity recognizer is Stanford's. 

23 Modified example from Bar-Haim et al.'s (2009) work. 
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(41) Kids are fond of sweets. 

(42) Kids like sweets. 

(43) Kids like candies. 

Another recognition approach, then, is to search for a sequence of rule applications or other 
transformations (e.g., replacing synonyms, or hypernyms-hyponyms) that turns one of the input 
expressions (or its syntactic or semantic representation) to the other. We call this search decoding, 
because it is similar to the decoding stage of machine translation (to be discussed in Section 3), 
where a sequence of transformations that turns a source-language expression into a target-language 
expression is sought. In our case, if a sequence is found, the two input expressions constitute a 
positive paraphrasing or textual entailment pair, depending on the rules used; otherwise, the pair is 
negative. If each rule is associated with a confidence score (possibly learnt from a training dataset) 
that reflects the degree to which the rule preserves the original meaning in paraphrase recognition, 
or the degree to which we are confident that it produces an entailed expression, we may search for 
the sequence of transformations with the maximum score (or minimum cost), much as in approaches 
that compute the minimum (string or tree) edit distance between the two input expressions. The pair 
of input expressions can then be classified as positive if the maximum-score sequence exceeds a 
confidence threshold (Harmeling, 2009). One would also have to consider the contexts where rules 
are applied, because a rule may not be valid in all contexts, for instance because of the different 
possible senses of the words it involves. A possible solution is to associate each rule with a vector 
that represents the contexts where it can be used (e.g., a vector of frequently occurring words in 
training contexts where the rule applies), and use a rule only in contexts that are similar to its 
associated context vector; with slotted rules, one can also model the types of slot values (e.g., types 
of named entities) the rule can be used with the work of Pantel, Bhagat, Coppola, Chklovski, and 
Hovy (2007), and Szpektor, Dagan, Bar-Haim, and Goldberger (2008). 

Resouces like WordNet and extraction methods, however, provide thousands or millions of rules, 
giving rise to an exponentially large number of transformation sequences to consider. 24 When 
operating at the level of semantic representations, the sequence sought is in effect a proof that the 
two input expressions are paraphrases or a valid textual entailment pair, and it may be obtained 
by exploiting theorem provers, as discussed earlier. Bar-Haim et al. (2007) discuss how to search 
for sequences of transformations, seen as proofs at the syntactic level, when the input language 
expressions and their reformulations are represented by dependency trees. In subsequent work (Bar- 
Haim et al, 2009), they introduce compact forests, a data structure that allows the dependency trees 
of multiple intermediate reformulations to be represented by a single graph, to make the search 
more efficient. They also combine their approach with an SVM-based recognizer; sequences of 
transformations are used to bring T closer to H, and the SVM recognizer is then employed to judge 
if the transformed T and H consitute a positive textual entailment pair or not. 

2.8 Evaluating Recognition Methods 

Experimenting with paraphase and textual entailment recognizers requires datasets containing both 
positive and negative input pairs. When using discriminative classifiers (e.g., SVMs), the negative 
training pairs must ideally be near misses, otherwise they may be of little use (Schohn & Cohn, 
2000; Tong & Roller, 2002). Near misses can also make the test data more challenging. 

24 Collections of transformation rules and resources that can be used to obtain such rules are listed at the ACL Textual 
Entailment Portal. Mirkin et al. (2009a) discuss how to evaluate collections of textual entailment rules. 



16 



A Survey of Paraphrasing and Textual Entailment Methods 



method 


accuracy (%) 


precision (%) 


recall (%) 


F -measure (%) 


Corley & Mihalcea (2005) 


71.5 


72.3 


92.5 


81.2 


Das & Smith (2009) 


76.1 


79.6 


86.1 


82.9 


Finch et al. (2005) 


75.0 


76.6 


89.8 


82.7 


Malakasiotis (2009) 


76.2 


79.4 


86.8 


82.9 


Qiu et al. (2006) 


72.0 


72.5 


93.4 


81.6 


Wan et al. (2006) 


75.6 


77.0 


90.0 


83.0 


Zhang & Patrick (2005) 


71.9 


74.3 


88.2 


80.7 


BASE; 


66.5 


66.5 


100.0 


79.9 


BASE 2 


69.0 


72.4 


86.3 


78.8 



Table 1: Paraphrase recognition results on the MSR corpus. 



The most widely used benchmark dataset for paraphrase recognition is the Microsoft Research 
(MSR) Paraphrase Corpus. It contains 5,801 pairs of sentences obtained from clusters of online news 
articles referring to the same events (Dolan, Quirk, & Brockett, 2004; Dolan & Brockett, 2005). 
The pairs were initially filtered by heuristics, which require, for example, the word edit distance 
of the two sentences in each pair to be neither too small (to avoid nearly identical sentences) nor 
too large (to avoid too many negative pairs); and both sentences to be among the first three of 
articles from the same cluster (articles referring to the same event), the rationale being that initial 
sentences often summarize the events. The candidate paraphrase pairs were then filtered by an 
S VM-based paraphrase recognizer (Brockett & Dolan, 2005), trained on separate manually classified 
pairs obtained in a similar manner, which was biased to overidentify paraphrases. Finally, human 
judges annotated the remaining sentence pairs as paraphrases or not. After resolving disagreements, 
approximately 67% of the 5,801 pairs were judged to be paraphrases. The dataset is divided in 
two non-overlapping parts, for training (30% of all pairs) and testing (70%). Zhang and Patrick 
(2005) and others have pointed out that the heuristics that were used to construct the corpus may 
have biased it towards particular types of paraphrases, excluding for example paraphrases that do 
not share any common words. 

Table 1 lists all the published results of paraphrase recognition experiments on the MSR corpus 
we are aware of. We include two baselines we used: BASE] classifies all pairs as paraphrases; 
BASE2 classifies two sentences as paraphrases when their surface word edit distance is below a 
threshold, tuned on the training part of the corpus. Four commonly used evaluation measures are 
used: accuracy, precision, recall, and F-measure with equal weight on precision and recall. These 
measures are defined below. TP (true positives) and FP (false positives) are the numbers of pairs 
that have been correctly or incorrectly, respectively, classified as positive (paraphrases). TN (true 
negatives) and FN (false negatives) are the numbers of pairs that have been correctly or incorrectly, 
respectively, classified as negative (not paraphrases). 

precision TP | FP ' reCdll TP | FN ' 

TP+TN p 2- precision- recall 

accuracy — Tp+m+Fp+FN , r -measure — precision+recaU 

All the systems of Table 1 have better recall than precision, which implies they tend to over-classify 
pairs as paraphrases, possibly because the sentences of each pair have at least some common words 
and refer to the same event. Systems with higher recall tend to have lower precision, and vice versa, 
as one would expect. The high F-measure of BASEi is largely due to its perfect recall; its precision 



17 



Androutsopoulos & Malakasiotis 



method 


accuracy (%) 


precision (%) 


recall (%) 


F -measure (%) 


Bensley & Hickl (2008) 


74.6 








Iftene (2008) 


72.1 


65.5 


93.2 


76.9 


Siblini & Kosseim (2008) 


68.8 








Wang & Neumann (2008) 


70.6 








BASE! 


50.0 


50.0 


100.0 


66.7 


BASE 2 


54.9 


53.6 


73.6 


62.0 



Table 2: Textual entailment recognition results (for two classes) on the RTE-4 corpus. 



is significantly lower, compared to the other systems. BASE2, which uses only string edit distance, 
is a competitive baseline for this corpus. Space does not permit listing published evaluation results 
of all the paraphrase recognition methods that we have discussed. Furthermore, comparing results 
obtained on different datasets is not always meaningful. 

For textual entailment recognition, the most widely used benchmarks are those of the RTE chal- 
lenges. As an example, the RTE-3 corpus contains 1,600 (T,H) pairs (positive or negative). Four 
application scenarios where textual entailment recognition might be useful were considered: infor- 
mation extraction, information retrieval, question answering, and summarization. There are 200 
training and 200 testing pairs for each scenario; Dagan et al. (2009) explain how they were con- 
structed. The RTE-4 corpus was constructed in a similar way, but it contains only test pairs, 250 for 
each of the four scenarios. A further difference is that in RTE-4 the judges classified the pairs in 
three classes: true entailment pairs, false entailment pairs where H contradicts T (Harabagiu, Hickl, 
& Lacatusu, 2006; de Marneffe, Rafferty, & Manning, 2008), and false pairs where reading T does 
not lead to any conclusion about H; a similar pilot task was included in RTE-3 (Voorhees, 2008). 
The pairs of the latter two classes can be merged, if only two classes (true and false) are desirable. 
We also note that RTE-3 included a pilot task requiring systems to justify their answers. Many of 
the participants, however, used technical or mathematical terminology in their explanations, which 
was not always appreciated by the human judges; also, the entailments were often obvious to the 
judges, to the extent that no justification was considered necessary (Voorhees, 2008). Table 2 lists 
the best accuracy results of RTE-4 participants (for two classes only), along with results of the two 
baselines described previously; precision, recall, and F-measure scores are also shown, when avail- 
able. All four measures are defined as in paraphrase recognition, but positives and negatives are now 
textual entailment pairs. 25 Again, space does not permit listing published evaluation results of all 
the textual entailment recognition methods that we have discussed, and comparing results obtained 
on different datasets is not always meaningful. 

It is also possible to evaluate recognition methods indirectly, by measuring their impact on the 
performance of larger natural language processing systems (Section 1.1). For instance, one could 
measure the difference in the performance of a QA system, or the degree to which the redundancy 
of a generated summary is reduced when using paraphrase and/or textual entailment recognizers. 



Average precision, borrowed from information retrieval evaluation, has also been used in the RTE challenges. 
Bergmair (2009), however, argues against using it in RTE challenges and proposes alternative measures. 
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3. Paraphrase and Textual Entailment Generation 

Unlike recognizers, paraphrase or textual entailment generators are given a single language expres- 
sion (or template) as input, and they are required to produce as many output language expressions 
(or templates) as possible, such that the output expressions are paraphrases or they constitute, along 
with the input, correct textual entailment pairs. Most generators assume that the input is a single 
sentence (or sentence template), and we adopt this assumption in the remainder of this section. 

3.1 Generation Methods Inspired by Statistical Machine Translation 

Many generation methods borrow ideas from statistical machine translation (SMT). 26 Let us first in- 
troduce some central ideas from SMT, for the benefit of readers unfamiliar with them. SMT methods 
rely on very large bilingual or multilingual parallel corpora, for example the proceedings of the Eu- 
ropean parliament, without constructing meaning representations and often, at least until recently, 
without even constructing syntactic representations. 27 Let us assume that we wish to translate a 
sentence F, whose words are /i,/2, . . . ,f\F\ in that order, from a foreign language to our native lan- 
guage. Let us also denote by N any candidate translation, whose words are a\,a2, ■ ■ ■ The best 
translation, denoted ,/V*, is the ./V with the maximum probability of being a translation of F, i.e: 

N* = argmaxP(AT|F) = argmax ^^^y^ = argmaxP(AT)P(F|Ar) (44) 

Since F is fixed, the denominator P(F) above is constant and can be ignored when searching for 
N*. P(N) is called the language model and P(F\N) the translation model. 

For modeling purposes, it is common to assume that F was in fact originally written in our native 
language and it was transmitted to us via a noisy channel, which introduced various deformations. 
The possible deformations may include, for example, replacing a native word with one or more 
foreign ones, removing or inserting words, moving words to the left or right etc. The commonly 
used IBM models 1 to 5 (Brown et al., 1993) provide an increasingly richer inventory of word 
deformations; more recent phrase-based SMT systems (Koehn, Och, & Marcu, 2003) also allow 
directly replacing entire phrases with other phrases. The foreign sentence F can thus be seen as the 
result of applying a sequence of transformations D = (d\,d2,.- -,d\o\) to N, and it is common to 
search for the N* that maximizes (45); this search is called decoding. 

N* = aigmaxP(N)maxP(F,D\N) (45) 

N D 



An exhaustive search is usually intractable. Hence, heuristic search algorithms (e.g., based on beam 
search) are usually employed (Germann, Jahr, Knight, Marcu, & Yamada, 2001; Koehn, 2004). 28 

Assuming for simplicity that the individual deformations of D are mutually independent, 
P(F,D\N) can be computed as the product of the probabilities of D's individual deformations. Given 
a bilingual parallel corpus with words aligned across languages, we can estimate the probabilities 



For an introduction to SMT, see chapter 25 of the book "Speech and Language Processing" (Jurafsky & Martin, 
2008), and chapter 13 of the book "Foundations of Statistical Natural Language Processing" (Manning & Schuetze, 
1999). For a more extensive discussion, consult the work of Koehn (2009). 

27 See Koehn's Statistical Machine Translation site for commonly used SMT corpora and tools. 

28 A frequently used SMT system that includes decoding facilities is Moses. 
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of all possible deformations In practice, however, parallel corpora do not indicate word align- 
ment. Hence, it is common to find the most probable word alignment of the corpus given initial 
estimates of individual deformation probabilities, then re-estimate the deformation probabilities 
given the resulting alignment, and iterate (Brown et al., 1993; Och & Ney, 2003). 29 

The translation model P(F,D\N) estimates the probability of obtaining F from N via D; we are 
interested in Ns with high probabilities of leading to F. We also want, however, N to be gram- 
matical, and we use the language model P(N) to check for grammaticality. P(N) is the probability 
of encountering N in our native language; it is estimated from a large monolingual corpus of our 
language, typically assuming that the probability of encountering word a,- depends only on the pre- 
ceding n—l words. For n = 3, P(N) becomes: 

P(N) =P(ai) ■P(a 2 \a l )-P(a 3 \ai,a 2 )-P{a 4 \a 2 ,a 3 )---P(a\ N \\a\ N \_ 2 ,a\ N \_ l ) (46) 

A language model typically also includes smoothening mechanisms, to cope with ra-grams that are 
very rare or not present in the monolingual corpus, which would lead to P(N) = 0. 30 

In principle, an SMT system could be used to generate paraphrases, if it could be trained on a 
sufficiently large monolingual corpus of parallel texts. Both N and F are now sentences of the same 
language, but N has to be different from the given F, and it has to convey the same (or almost the 
same) information. The main problem is that there are no readily available monolingual parallel 
corpora of the sizes that are used in SMT, to train the language model on them. One possibility 
is to use multiple translations of the same source texts; for example, different English translations 
of novels originally written in other languages (Barzilay & McKeown, 2001), or multiple English 
translations of Chinese news articles, as in the Multiple-Translation Chinese Corpus. Corpora of 
this kind, however, are still orders of magnitude smaller than those used in SMT. 

To bypass the lack of large monolingual parallel corpora, Quirk et al. (2004) use clusters of 
news articles referring to the same event. The articles of each cluster do not always report the same 
information and, hence, they are not parallel texts. Since they talk about the same event, however, 
they often contain phrases, sentences, or even longer fragments with very similar meanings; corpora 
of this kind are often called comparable. From each cluster, Quirk et al. select pairs of similar 
sentences (e.g., with small word edit distance, but not identical sentences) using methods like those 
employed to create the MSR corpus (Section 2. 8). 31 The sentence pairs are then word aligned as 
in machine translation, and the resulting alignments are used to create a table of phrase pairs as 
in phrase-based SMT systems (Koehn et al., 2003). A phrase pair {P\,P2) consists of contiguous 
words (taken to be a phrase, though not necessarily a syntactic constituent) P\ of one sentence that 
are aligned to different contiguous words P 2 of another sentence. Quirk et al. provide the following 
examples of discovered pairs. 



GIZA++ is often used to train IBM models and align words. 

30 See chapter 4 of the book "Speech and Language Processing" (Jurafsky & Martin, 2008) and chapter 6 of the book 
"Foundations of Statistical Natural Language Processing" (Manning & Schuetze, 1999) for an introduction to language 
models. SRILM (Stolcke, 2002) is a commonly used tool to create language models. 

31 Wubben et al. (2009) discuss similar methods to pair news titles. Barzilay & Elhadad (2003) and Nelken & Shieber 
(2006) discuss more general methods to align sentences of monolingual comparable corpora. Sentence alignment methods 
for bilingual parallel or comparable corpora are discussed, for example, by Gale and Church (1993), Melamed (1999), 
Fung and Cheung (2004), Munteanu and Marcu (2006); see also the work of Wu (2000). Sentence alignment methods 
for parallel corpora may perform poorly on comparable corpora (Nelken & Shieber, 2006). 
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Pi 


P 2 


injured 


wounded 


Bush administration 


White House 


margin of error 


error margin 







Phrase pairs that occur frequently in the aligned sentences may be assigned higher probabili- 
ties; Quirk et al. use probabilities returned by IBM model 1. Their decoder first constructs a lattice 
that represents all the possible paraphrases of the input sentence that can be produced by replacing 
phrases by their counterparts in the phrase table; i.e., the possible deformations di(-) are the phrase 
replacements licensed by the phrase table. 32 Unlike machine translation, not all of the words or 
phrases need to be replaced, which is why Quirk et al. also allow a degenerate identity deforma- 
tion did{£, ) = £ ; assigning a high probability to the identity deformation leads to more conservative 
paraphrases, with fewer phrase replacements. The decoder uses the probabilities of d{(-) to com- 
pute P(F,D\N) in equation (45), and the language model to compute P(N). The best scored N* is 
returned as a paraphrase of F; the n most highly scored As could also be returned. More generally, 
the table of phrase pairs may also include synonyms obtained from WordNet or similar resources, 
or pairs of paraphrases (or templates) discovered by paraphrase extraction methods; in effect, Quirk 
et al.'s construction of a monolingual phrase table is a paraphrase extraction method. A language 
model may also be applied locally to the replacement words of a deformation and their context to 
assess whether or not the new words fit the original context (Mirkin et al., 2009b). 

Zhao et al. (2008, 2009) demonstrated that combining phrase tables derived from multiple re- 
sources improves paraphrase generation. They also proposed scoring the candidate paraphrases 
by using an additional, application-dependent model, called the usability model; for example, in 
sentence compression (Section 1.1) the usability model rewards As that have fewer words than F. 
Equation (45) then becomes (47), where U (F,N) is the usability model and A, are weights assigned 
to the three models; similar weights can be used in (45). 

N* = KgmmU(F,N) Xl P(N)* a max.P(F,D\N)** (47) 

N D 

Zhao et al. actually use a log-linear formulation of (47); and they select the weights A,- that maxi- 
mize an objective function that rewards many and correct (as judged by human evaluators) phrasal 
replacements. 33 One may replace the translation model by a paraphrase recognizer (Section 2) that 
returns a confidence score; in its log-linear formulation, (47) then becomes (48), where R(F,N) is 
the confidence score of the recognizer. 

N* = argmax[Ai logU{F,N)+X 2 logP(N)+X 3 logR{F,N)} (48) 

N 

Including hyponyms-hypernyms or textual entailment rules (Section 2.7) in the phrase table 
would generate sentences N that textually entail or are entailed (depending on the direction of the 



Chevelu et al. (2009) discuss how other decoders could be developed especially for paraphrase generation. 
33 In a "reluctant paraphrasing" setting (Dras, 1998), for example when revising a document to satisfy length require- 
ments, readability measures, or other externally imposed constraints, it may be desirable to use an objective function that 
rewards making as few changes as possible, provided that the constraints are satisfied. Dras (1998) discusses a formulation 
of this problem in terms of integer programming. 
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rules and whether we replace hyponyms by hypernyms or the reverse) by F. SMT-inspired methods, 
however, have been used mostly in paraphrase generation, not in textual entailment generation. 

Paraphrases can also be generated by using pairs of machine translation systems to translate the 
input expression to a new language, often called a pivot language, and then back to the original 
language. The resulting expression is often different from the input one, especially when the two 
translation systems employ different methods. By using different pairs of machine translation sys- 
tems or different pivot languages, multiple paraphrases may be obtained. Duboue and Chu-Carroll 
(2006) demonstrated the benefit of using this approach to paraphrase questions, with an additional 
machine learning classifier to filter the generated paraphrases; their classifier uses features such as 
the cosine similarity between a candidate generated paraphrase and the original question, the lengths 
of the candidate paraphrase and the original question, features showing whether or not both ques- 
tions are of the same type (e.g., both asking for a person name), etc. An advantage of this approach 
is that the machine translation systems can be treated as black boxes, and they can be trained on 
readily available parallel corpora of different languages. A disadvantage is that translation errors 
from both directions may lead to poor paraphrases. We return to pivot languages in Section 4. 

In principle, the output of a generator may be produced by mapping the input to a representation 
of its meaning, a process that usually presupposes parsing, and by passing on the meaning represen- 
tation, or new meaning representations that are logically entailed by the original one, to a natural 
language generation system (Reiter & Dale, 2000; Bateman & Zock, 2003) to produce paraphrases 
or entailed language expressions. This approach would be similar to using language-independent 
meaning representations (an "interlingua") in machine translation, but here the meaning representa- 
tions would not need to be language-independent, since only one language is involved. An approach 
similar to syntactic transfer in machine translation may also be adopted (McKeown, 1983). In that 
case, the input language expression (assumed to be a sentence) is first parsed. The resulting syntac- 
tic representation is then modified in ways that preserve, or affect only slightly, the original meaning 
(e.g., turning a sentence from active to passive), or in ways that produce syntactic representations 
of entailed language expressions (e.g., pruning certain modifiers or subordinate clauses). New lan- 
guage expressions are then generated from the new syntactic representations, possibly by invoking 
the surface realization components of a natural language generation system. Parsing, however, the 
input expression may introduce errors, and producing a correct meaning representation of the input, 
when this is required, may be far from trivial. Furthermore, the natural language generator may be 
capable of producing language expressions of only a limited variety, missing possible paraphrases 
or entailed language expressions. This is perhaps why meaning representation and syntactic transfer 
do not seem to be currently popular in paraphrase and textual entailment generation. 

3.2 Generation Methods that Use Bootstrapping 

When the input and output expressions are slotted templates, it is possible to apply bootstrapping 
to a large monolingual corpus (e.g., the entire Web), instead of using machine translation methods. 
Let us assume, for example, that we wish to generate paraphrases of (49), and that we are given a 
few pairs of seed values of X and Y, as in (50) and (51). 

(49) X is the author of Y. 

(50) (X = "Jack Kerouac", Y = "On the Road") 

(51) (X = "Jules Verne", Y = "The Mysterious Island") 

We can retrieve from the corpus sentences that contain any of the seed pairs: 
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(52) Jack Kerouac wrote "On the Road". 

(53) "The Mysterious Island" was written by Jules Verne. 

(54) Jack Kerouac is most known for his novel "On the Road". 

By replacing the known seeds with the corresponding slot names, we obtain new templates: 

(55) X wrote Y. 

(56) Y was written by X. 

(57) X is most known for his novel Y. 

In our example, (55) and (56) are paraphrases of (49); however, (57) textually entails (49), but is 
not a paraphrase of (49). If we want to generate paraphrases, we must keep (55) and (56) only; if we 
want to generate templates that entail (49), we must keep (57) too. Some of the generated candidate 
templates may neither be paraphrases of, nor entail (or be entailed by) the original template. A good 
paraphrase or textual entailment recognizer (Section 2) or a human in the loop would be able to filter 
out bad candidate templates; see also Duclaye et al.'s (2003) work, where Expectation Maximization 
(Mitchell, 1997) is used to filter the candidate templates. Simpler filtering techniques may also be 
used. For example, Ravichandran et al. (2002, 2003) assign to each candidate template a pseudo- 
precision score; roughly speaking, the score is computed as the number of retrieved sentences that 
match the candidate template with X and Y having the values of any seed pair, divided by the 
number of retrieved sentences that match the template when X has a seed value and Y any value, 
not necessarily the corresponding seed value. 

Having obtained new templates, we can search the corpus for new sentences that match them; 
for example, sentence (58) matches the generated template (56). From the new sentences, more seed 
values can be obtained, if the slot values correspond to types of expressions (e.g., person names) that 
can be recognized reasonably well, for example by using a named entity recognizer or a gazetteer 
(e.g., a large list of book titles); from (58) we would obtain the new seed pair (59). More iterations 
may be used to generate more templates and more seeds, until no more templates and seeds can be 
discovered or a maximum number of iterations is reached. 

(58) Frankenstein was written by Mary Shelley. 

(59) (X = "Mary Shelley", Y = "Frankenstein") 

Figure 4 illustrates how a bootstrapping paraphrase generator works. Templates that textually entail 
or that are textually entailed by an initial template, for which seed slot values are provided, can be 
generated similarly, if the paraphrase recognizer is replaced by a textual entailment recognizer. 

If slot values can be recognized reliably, we can also obtain the initial seed slot values auto- 
matically by retrieving directly sentences that match the original templates and by identifying the 
slot values in the retrieved sentences. 34 If we are also given a mechanism to identify sentences of 
interest in the corpus (e.g., sentences involving particular terms, such as names of known diseases 
and medicines), we can also obtain the initial templates automatically, by identifying sentences of 
interest, identifying slot values (e.g., named entities of particular categories) in the sentences, and 
using the contexts of the slot values as initial templates. In effect, the generation task then becomes 
an extraction one, since we are given a corpus, but neither initial templates nor seed slot values. 
TEASE (Szpektor, Tanev, Dagan, & Coppola, 2004) is a well-known bootstrapping method of this 

34 Seed slot values per semantic relation can also be obtained from databases (Mintz, Bills, Snow, & Jurafsky, 2009). 
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kind, which produces textual entailment pairs, for example pairs like (60)-(61), given only a mono- 
lingual (non-parallel) corpus and a dictionary of terms. (60) textually implies (61), for example in 
contexts like those of (62)-(63), but not the reverse. 35 

(60) X prevents Y 

(61) X reduces Y risk 

(62) Aspirin prevents heart attack. 

(63) Aspirin reduces heart attack risk. 

TEASE does not specify the directionality of the produced template pairs, for example whether (60) 
textually entails (61) or vice versa, but additional mechanisms have been proposed that attempt to 
guess the directionality; we discuss one such mechanism, LEDIR (Bhagat, Pantel, & Hovy, 2007), in 
Section 4.1 below. Although TEASE can also be used as a generator, if particular input templates are 
provided, we discuss it further in Section 4.2, along with other bootstrapping extraction methods, 
since in its full form it requires no initial templates (nor seed slot values). The reader is reminded 
that the boundaries between recognizers, generators, and extractors are not always clear. 

Similar bootstrapping methods have been used to generate information extraction patterns (Riloff 
& Jones, 1999; Xu, Uszkoreit, & Li, 2007). Some of these methods, however, require corpora an- 
notated with instances of particular types of events to be extracted (Huffman, 1995; Riloff, 1996b; 
Soderland, Fisher, Aseltine, & Lehnert, 1995; Soderland, 1999; Muslea, 1999; Califf & Mooney, 
2003), or texts that mention the target events and near-miss texts that do not (Riloff, 1996a). 

Marton et al. (2009) used a similar approach, but without iterations, to generate paraphrases of 
unknown source language phrases in a phrase-based SMT system (Section 1.1). For each unknown 
phrase, they collected contexts where the phrase occurred in a monolingual corpus of the source 
language, and they searched for other phrases (candidate paraphrases) in the corpus that occurred 
in the same contexts. They subsequently produced feature vectors for both the unknown phrase and 
its candidate paraphrases, with each vector showing how often the corresponding phrase cooccurred 
with other words. The candidate paraphrases were then ranked by the similarity of their vectors 
to the vector of the unknown phrase. The unknown phrases were in effect replaced by their best 
paraphrases that the SMT system knew how to map to target language phrases, and this improved 
the SMT system's performance. 

3.3 Evaluating Generation Methods 

In most generation applications, for example when rephrasing queries to a QA system (Section 1.1), 
it is desirable not only to produce correct outputs (correct paraphrases, or expressions that constitute 
correct textual entailment pairs along with the input), but also to produce as many correct outputs as 
possible. The two goals correspond to high precision and recall, respectively. For a particular input 
Si, the precision pi and recall r, of a generator can now be defined as follows (cf. Section 2.8). 7T 5 , 
is the number of correct outputs for input s,, FP, is the number of wrong outputs for s,-, and FNi is 
the number of outputs for s; that have incorrectly not been generated (missed). 

n . - TPj _ TPj 

ri — TPj+FPj' ~ TPj+FNi 

The precision and recall scores of a method over a set of inputs {$,■} can then be defined using 
micro-averaging or macro-averaging: 

35 Example from the work of Szpektor et al. (2004). 
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Figure 4: Generating paraphrases of "X wrote Y" by bootstrapping. 
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In any case, however, recall cannot be computed in generation, because FNi is unknown; there are 
numerous correct paraphrases of an input s ; - that may have been missed, and there are even more (if 
not infinite) language expressions that entail or are entailed by s,. 36 

Instead of reporting recall, it is common to report (along with precision) the average number 
of outputs, sometimes called yield, defined below, where we assume that there are n test inputs. 
A better option is to report the yield at different precision levels, since there is usually a tradeoff 
between the two figures, which is controlled by parameter tuning (e.g., selecting different values of 
thresholds involved in the methods). 

yield =-j^{TPi + FPi) 

n !=1 

Note that if we use a fixed set of test inputs {j,}, if we store the sets O r f of all the correct 
outputs that a reference generation method produces for each s\, and if we treat each O r f as the set 
of all possible correct outputs that may be generated for s,-, then both precision and recall can be 
computed, and without further human effort when a new generation method, say M, is evaluated. 
FNi is then the number of outputs in O r f that have not been produced for 57 by M; FPi is the number 
of M's outputs for Sj that are not in O r f\ and 7T 5 ,- is the number of M's outputs for Sj that are included 
in O r f. Callison-Burch et al. (2008) propose an evaluation approach of this kind for what we call 
paraphrase generation. They use phrase alignment heuristics (Och & Ney, 2003; Cohn et al., 2008) 



36 Accuracy (Section 2.8) is also impossible to compute in this case; apart from not knowing FNi, the number of 
outputs that have correctly not been generated (TNi) is infinite. 
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to obtain aligned phrases (e.g., "resign", "tender his resignation", "leave office voluntarily") from 
manually word-aligned sentences with the same meanings (from the Multiple-Translation Chinese 
Corpus). Roughly speaking, they use as {$,•} phrases for which alignments have been found; and 
for each 57, O r f contains the phrases 57 was aligned to. Since O r f, however, contains much fewer 
phrases than the possible correct paraphrases of 57, the resulting precision score is a (possibly very 
pessimistic) lower bound, and the resulting recall scores only measure to what extent M managed 
to discover the (relatively few) paraphrases in 0\ , as pointed out by Callison-Burch et al. 

To the best of our knowledge, there are no widely adopted benchmark datasets for paraphrase 
and textual entailment generation, unlike recognition, and comparing results obtained on different 
datasets is not always meaningful. The lack of generation benchmarks is probably due to the fact 
that although it is possible to assemble a large collection of input language expressions, it is practi- 
cally impossible to specify in advance all the numerous (if not infinite) correct outputs a generator 
may produce, as already discussed. In principle, one could use a paraphrase or textual entailment 
recognizer to automatically judge if the output of a generator is a paraphrase of, or forms a correct 
entailment pair with the corresponding input expression. Current recognizers, however, are not yet 
accurate enough, and automatic evaluation measures from machine translation (e.g., BLEU, Section 
2.3) cannot be employed, exactly because their weakness is that they cannot detect paraphrases and 
textual entailment. An alternative, more costly solution is to use human judges, which also allows 
evaluating other aspects of the outputs, such as their fluency (Zhao et al., 2009), as in machine trans- 
lation. One can also evaluate the performance of a generator indirectly, by measuring its impact on 
the performance of larger natural language processing systems (Section 1.1). 

4. Paraphrase and Textual Entailment Extraction 

Unlike recognition and generation methods, extraction methods are not given particular input lan- 
guage expressions. They typically process large corpora to extract pairs of language expressions 
(or templates) that constitute paraphrases or textual entailment pairs. The generated pairs are stored 
to be used subsequently by recognizers and generators or other applications (e.g., as additional en- 
tries of phrase tables in SMT systems). Most extraction methods produce pairs of sentences (or 
sentence templates) or pairs of shorter expressions. Methods to discover synonyms, hypernym- 
hyponym pairs or, more generally, entailment relations between words (Lin, 1998a; Hearst, 1998; 
Moore, 2001; Glickman & Dagan, 2004; Brockett & Dolan, 2005; Hashimoto, Torisawa, Kuroda, 
De Saeger, Murata, & Kazama, 2009; Herbelot, 2009) can be seen as performing paraphrase or 
textual entailment extraction restricted to pairs of single words. 

4.1 Extraction Methods Based on the Distributional Hypothesis 

A possible paraphrase extraction approach is to store all the word ra-grams that occur in a large 
monolingual corpus (e.g., for n < 5), along with their left and right contexts, and consider as para- 
phrases «-grams that occur frequently in similar contexts. For example, each n-gram can be repre- 
sented by a vector showing the words that typically precede or follow the ra-gram, with the values 
in the vector indicating how strongly each word co-occurs with the ra-gram; for example, pointwise 
mutual information values (Manning & Schuetze, 1999) may be used. Vector similarity measures, 
for example cosine similarity or Lin's measure (1998a), can then be employed to identify «-grams 
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that occur in similar contexts by comparing their vectors. 37 This approach has been shown to be 
viable with very large monolingual corpora; Pasca and Dienes (2005) used a Web snapshot of ap- 
proximately a billion Web pages; Bhagat and Ravichandran (2008) used 150 GB of news articles 
and reported that results deteriorate rapidly with smaller corpora. Even if only lightweight linguistic 
processing (e.g., POS tagging, without parsing) is performed, processing such large datasets requires 
very significant processing power, although linear computational complexity is possible with appro- 
priate hashing of the context vectors (Bhagat & Ravichandran, 2008). Paraphrasing approaches of 
this kind are based on Harris's Distributional Hypothesis (1964), which states that words in similar 
contexts tend to have similar meanings. The bootstrapping methods of Section 3.2 are based on 
a similar hypothesis that phrases (or templates) occurring in similar contexts (or with similar slot 
values) tend to have similar meanings, a hypothesis that can be seen as an extension of Harris's. 

Lin and Pantel's (2001) well-known extraction method, called DIRT, is also based on the ex- 
tended Distributional Hypothesis, but it operates at the syntax level. DIRT first applies a dependency 
grammar parser to a monolingual corpus. Parsing the corpus is generally time-consuming and, 
hence, smaller corpora have to be used, compared to methods that do not require parsing; Lin and 
Panel used 1 GB of news texts in their experiments. Dependency paths are then extracted from the 
dependency trees of the corpus. Let us consider, for example, sentences (64) and (67). Their de- 
pendency trees are shown in Figure 5; the similarity between the two sentences is less obvious than 
in Figure 1, because of the different verbs that are now involved. Two of the dependency paths that 
can be extracted from the trees of Figure 5 are shown in (65) and (68). The labels of the edges are 
augmented by the POS-tags of the words they connect (e.g., A:sub j:V instead of simply sub j). 38 
The first and last words of the extracted paths are replaced by slots, shown as boxed and numbered 
POS-tags. Roughly speaking, the paths of (65) and (68) correspond to the surface templates of (66) 
and (69), respectively, but the paths are actually templates specified at the syntactic level. 

(64) A mathematician found a solution to the problem. 



(65) N\ :subj:V «- found ->• V:obj:.A -> solution -> N:to: N2 

(66) N\ found [a] solution to N2 

(67) The problem was solved by a young mathematician. 



(68) N 3 :ob y.V <- solved ->• V:by: N 4 



(69) Nj, was solved by N4. 



DIRT imposes restrictions on the paths that can be extracted from the dependency trees; for 
example, they have to start and end with noun slots. Once the paths have been extracted, it looks for 
pairs of paths that occur frequently with the same slot fillers. If (65) and (68) occur frequently with 
the same fillers (e.g., N\ = N4 = "mathematician", N2 = N3 = "problem"), they will be included 
as a pair in dirt's output (with N\ = N4 and A2 = A3). A measure based on mutual information 
(Manning & Schuetze, 1999; Lin & Pantel, 2001) is used to detect paths with common fillers. 

Lin and Pantel call the pairs of templates that DIRT produces "inference rules", but there is 
no directionality between the templates of each pair; the intention seems to be to produce pairs of 
near paraphrases. The resulting pairs are actually often textual entailment pairs, not paraphrases, 



Zhitomirsky-Geffet and Dagan (2009) discuss a bootstrapping approach, whereby the vector similarity scores (ini- 
tially computed using pointwise mutual information values in the vectors) are used to improve the values in the vectors; 
the vector similarity scores are then re-computed. 

38 For consistency with previous examples, we show slightly different labels than those used by Lin and Pantel. 
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Figure 5: Dependency trees of sentences (64) and (67). 



and the directionality of the entailment is unspecified. Bhagat et al. (2007) developed a method, 
called LEDIR, to classify the template pairs {P\,P2) that DIRT and similar methods produce into three 
classes: (i) paraphrases, (ii) P\ textually entails P2 and not the reverse, or (iii) P2 textually entails 
Pi and not the reverse; with the addition of LEDIR, DIRT becomes a method that extracts separately 
pairs of paraphrase templates and pairs of directional textual entailment templates. Roughly speak- 
ing, LEDIR examines the semantic categories (e.g., person, location etc.) of the words that fill Pi and 
P^s slots in the corpus; the categories can be obtained by following WordNet's hypernym-hyponym 
hierarchies from the filler words up to a certain level, or by applying clustering to the words of 
the corpus and using the clusters of the filler words as their categories. 40 If Pi occurs with fillers 
from a substantially larger number of categories than P2, then LEDIR assumes Pi has a more general 
meaning than P2 and, hence, P2 textually entails Pi ; similarly for the reverse direction. If there is no 
substantial difference in the number of categories, Pi and P2 are taken to be paraphrases. Szpektor 
and Dagan (2008) describe a method similar to DIRT that produces textual entailment pairs of unary 
(single slot) templates (e.g., "X takes a nap" =^ "X sleeps") using a directional similarity measure 
for unary templates. 

Extraction methods based on the (extended) Distributional Hypothesis often produce pairs of 
templates that are not correct paraphrasing or textual entailment pairs, although they share many 
common fillers. In fact, pairs involving antonyms are frequent; according to Lin and Pantel (2001), 
DIRT finds "X solves Y" to be very similar to "X worsens Y"; and the same problem has been 
reported in experiments with LEDIR (Bhagat et al., 2007) and distributional approaches that operate 
at the surface level (Bhagat & Ravichandran, 2008). 

Ibrahim et al.'s (2003) method is similar to DIRT, but it assumes that a monolingual parallel 
corpus is available (e.g., multiple English translations of novels), whereas DIRT does not require 
parallel corpora. Ibrahim et al.'s method extracts pairs of dependency paths only from aligned 
sentences that share matching anchors. Anchors are allowed to be only nouns or pronouns, and they 
match if they are identical, if they are a noun and a compatible pronoun, if they are of the same 
semantic category etc. In (70)-(71), square brackets and subscripts indicate matching anchors. 41 
The pair of templates of (72)-(73) would be extracted from (70)-(71); for simplicity, we show 
sentences and templates as surface strings, although the method operates on dependency trees and 



■"Template pairs produced by DIRT are available on-line. 

40 For an introduction to clustering methods, consult chapter 14 of "Foundations of Statistical Natural Language Pro- 
cessing" (Manning & Schuetze, 1999). 

Simplified example from Ibrahim et al.'s work (2003). 
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paths. Matching anchors become matched slots. Heuristic functions are used to score the anchor 
matches (e.g., identical anchors are preferred to matching nouns and pronouns) and the resulting 
template pairs; roughly speaking frequently rediscovered template pairs are rewarded, especially 
when they occur with many different anchors. 

(70) The [clerk] i liked [Bovary] 2 . 

(71) [He] i was fond of [Bovary] 2 . 

(72) X liked Y. 

(73) X was fond of Y. 

By operating on aligned sentences of monolingual parallel corpora, Ibrahim et al.'s method may 
avoid, to some extent, producing pairs of unrelated templates that simply happen to share common 
slot fillers; the resulting pairs of templates are also more likely to be paraphrases, rather than simply 
textual entailment pairs, since they are obtained from aligned sentences of a monolingual parallel 
corpus. Large monolingual parallel corpora, however, are more difficult to obtain than non-parallel 
corpora, as already discussed. An alternative is to identify anchors in related sentences from com- 
parable corpora (Section 3.1), which are easier to obtain. Shinyama and Sekine (2003) find pairs 
of sentences that share the same anchors within clusters of news articles reporting the same event. 
In their method, anchors are named entities (e.g., person names) identified using a named entity 
recognizer, or pronouns and noun phrases that refer to named entities; heuristics are employed to 
identify likely referents. Dependency trees are then constructed from each pair of sentences, and 
pairs of dependency paths are extracted from the trees by treating anchors as slots. 

4.2 Extraction Methods that Use Bootstrapping 

Bootstrapping approaches can also be used in extraction, as in generation (Section 3.2), but with the 
additional complication that there is no particular input template nor seed values of its slots to start 
from. To address this complication, TEASE (Szpektor et al., 2004) starts with a lexicon of terms of a 
knowledge domain, for example names of diseases, symptoms etc. in the case of a medical domain; 
to some extent, such lexicons can be constructed automatically from a domain-specific corpus (e.g., 
medical articles) via term acquisition techniques (Jacquemin & Bourigault, 2003). TEASE then 
extracts from a (non-parallel) monolingual corpus pairs of textual entailment templates that can be 
used with the lexicon's terms as slot fillers. We have already shown a resulting pair of templates, 
(60)-(61), in Section 3.2; we repeat it as (74)-(75) below. Recall that TEASE does not indicate the 
directionality of the resulting template pairs, for example whether (74) textually entails (75) or vice 
versa, but mechanisms like LEDIR (Section 4.1) could be used to guess the directionality. 

(74) X prevents Y 

(75) X reduces Y risk 

Roughly speaking, TEASE first identifies noun phrases that cooccur frequently with each term of 
the lexicon, excluding very common noun phrases. It then uses the terms and their cooccurring 
noun phrases as seed slot values to obtain templates, and then the new templates to obtain more 
slot values, much as in Figure 4. In TEASE, however, the templates are actually slotted dependency 
paths, and the method includes a stage that merges compatible templates to form more general 
ones. 42 If particular input templates are provided, TEASE can be used as a generator (Section 3.2). 

42 Template pairs produced by TEASE are available on-line. 
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Barzilay and McKeown (2001) also used a bootstrapping method, but to extract paraphrases 
from a parallel monolingual corpus; they used multiple English translations of novels. Unlike 
previously discussed bootstrapping approaches, their method involves two classifiers (in effect, two 
sets of rules). One classifier examines the words the candidate paraphrases consist of, and a second 
one examines their contexts. The two classifiers use different feature sets (different views of the 
data), and the output of each classifier is used to improve the performance of the other one in 
an iterative manner; this is a case of co-training (Blum & Mitchell, 1998). More specifically, a 
POS tagger, a shallow parser, and a stemmer are first applied to the corpus, and the sentences are 
aligned across the different translations. Words that occur in both sentences of an aligned pair 
are treated as seed positive lexical examples; all the other pairs of words from the two sentences 
become seed negative lexical examples. From the aligned sentences (76)-(77), we obtain three seed 
positive lexical examples, shown in (78)-(80), and many more seed negative lexical examples, two 
of which are shown in (81)-(82). 43 Although seed positive lexical examples are pairs of identical 
words, as the algorithm iterates new positive lexical examples are produced, and some of them may 
be synonyms (e.g., "comfort" and "console") or pairs of longer paraphrases, as will be explained 
below. 

(76) He tried to comfort her. 

(77) He tried to console Mary. 

(78) {expression^ — "he" ,expression 2 = "he",+) 

(79) {expression^ — "tried", expression^ — "tried", +) 

(80) {expression^ = "to" ,expression 2 = "to",+) 

(81) {expression^ — "he" ,expression 2 = "tried",—) 

(82) {expression^ = "he" ,expression 2 = "to",—) 

The contexts of the positive (similarly, negative) lexical examples in the corresponding sen- 
tences are then used to construct positive (or negative) context rules, i.e., rules that can be used to 
obtain new pairs of positive (or negative) lexical examples. Barzilay and McKeown (2001) use the 
POS tags of the / words before and after the lexical examples as contexts, and in their experiments 
set / = 3. For simplicity, however, let us assume that / = 2; then, for instance, from (76)-(77) and 
the positive lexical example of (79), we obtain the positive context rule of (83). The rule says that 
if two aligned sentences contain two sequences of words, say X\ and A2, one from each sentence, 
and both X\ and A2 are preceded by the same pronoun, and both are followed by "to" and a (pos- 
sibly different) verb, then X\ and A2 are positive lexical examples. Identical subscripts in the POS 
tags denote identical words; for example, (83) requires both X\ and A2 to be preceded by the same 
pronoun, but the verbs that follow them may be different. 

(83) {lefti = (pronoun l ) 1 right l — {to\,verb),left 2 = {pronoun^), right l = (toi,verb),+) 

In each iteration, only the k strongest positive and negative context rules are retained. The 
strength of each context rule is its precision, i.e., for positive context rules, the number of positive 
lexical examples whose contexts are matched by the rule divided by the number of both positive and 
negative lexical examples matched, and similarly for negative context rules. Barzilay and McKeown 
(2001) used k = 10, and they also discarded context rules whose strength was below 95%. The 
resulting (positive and negative) context rules are then used to identify new (positive and negative) 

Simplified example from the work of Barzilay and McKeown (2001). 
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Figure 6: Word lattices obtained from sentence clusters in Barzilay and Lee's method. 

lexical examples. From the aligned (84)-(85), the rule of (83) would figure out that "tried" is a 
synonym of "attempted"; the two words would be treated as a new positive lexical example, shown 
in (86). 

(84) She tried to run away. 

(85) She attempted to escape. 

(86) (expression^ = "tried" , expression^ = "attempted", +) 

The context rules may also produce multi-word lexical examples, like the one shown in (87). 
The obtained lexical examples are generalized by replacing their words by their POS tags, giving 
rise to paraphrasing rules. From (87) we obtain the positive paraphrasing rule of (88); again, POS 
subscripts denote identical words, whereas superscripts denote identical stems. The rule of (88) 
says that any sequence of words consisting of a verb, "to", and another verb is a paraphrase of any 
other sequence consisting of the same initial verb, "to", and another verb of the same stem as the 
second verb of the first sequence, provided that the two sequences occur in aligned sentences. 

(87) (expression^ = "start to talk" ,expression 2 = "start talking", +) 

(88) (generalized-expression^ — (verbo, to, verb 1 ), generalized .expression^ = (verba, verb 1 ), +^ 

The paraphrasing rules are also filtered by their strength, which is the precision with which they 
predict paraphrasing contexts. The remaining paraphrasing rules are used to obtain more lexical 
examples, which are also filtered by the precision with which they predict paraphrasing contexts. 
The new positive and negative lexical examples are then added to the existing ones, and they are 
used to obtain, score, and filter new positive and negative context rules, as well as to rescore and 
filter the existing ones. The resulting context rules are then employed to obtain more lexical exam- 
ples, more paraphrasing rules, and so on, until no new positive lexical examples can be obtained 
from the corpus, or a maximum number of iterations is exceeded. Wang et al. (2009) added more 
scoring measures to Barzilay and McKeown's (2001) method to filter and rank the paraphrase pairs 
it produces, and used the extended method to extract paraphrases of technical terms from clusters 
of bug reports. 

4.3 Extraction Methods Based on Alignment 

Barzilay and Lee (2003) used two corpora of the same genre, but from different sources (news 
articles from two press agencies). They call the two corpora comparable, but they use the term with 
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a slightly different meaning than in previously discussed methods; the sentences of each corpus 
were clustered separately, and each cluster was intended to contain sentences (from a single corpus) 
referring to events of the same type (e.g., bomb attacks), not sentences (or documents) referring to 
the same events (e.g., the same particular bombing). From each cluster, a word lattice was produced 
by aligning the cluster's sentences with Multiple Sequence Alignment (Durbin, Eddy, Krogh, & 
Mitchison, 1998; Barzilay & Lee, 2002). The solid lines of Figure 6 illustrate two possible resulting 
lattices, from two different clusters; we omit stop-words. Each sentence of a cluster corresponds to 
a path in the cluster's lattice. In each lattice, nodes that are shared by a high percentage (50% in 
Barzilay and Lee's experiments) of the cluster's sentences are considered backbone nodes. Parts of 
the lattice that connect otherwise consecutive backbone nodes are replaced by slots, as illustrated in 
Figure 6. The two lattices of our example correspond to the surface templates (89)-(90). 

(89) X bombed Y. 

(90) Y was bombed by X. 

The encountered fillers of each slot are also recorded. If two slotted lattices (templates) from dif- 
ferent corpora share many fillers, they are taken to be a pair of paraphrases (Figure 6). Hence, this 
method also uses the extended Distributional Hypothesis (Section 4.1). 

Pang et al.'s method (2003) produces finite state automata very similar to Barzilay and Lee's 
(2003) lattices, but it requires a parallel monolingual corpus; Pang et al. used the Multiple-Translation 
Chinese Corpus (Section 3.1) in their experiments. The parse trees of aligned sentences are con- 
structed and then merged as illustrated in Figure 7; vertical lines inside the nodes indicate sequences 
of necessary constituents, whereas horizontal lines correspond to disjunctions. 44 In the example of 
Figure 7, both sentences consist of a noun phrase (NP) followed by a verb phrase (VP); this is re- 
flected to the root node of the merged tree. In both sentences, the noun phrase is a cardinal number 
(CD) followed by a noun (NN); however, the particular cardinal numbers and nouns are different 
across the two sentences, leading to leaf nodes with disjunctions. The rest of the merged tree is 
constructed similarly; consult Pang at al. for further details. Presumably one could also generalize 
over cardinal numbers, types of named entities etc. 

Each merged tree is then converted to a finite state automaton by traversing the tree in a depth- 
first manner and introducing a ramification when a node with a disjunction is encountered. Figure 8 
shows the automaton that corresponds to the merged tree of Figure 7. All the language expressions 
that can be produced by the automaton (all the paths from the start to the end node) are paraphrases. 
Hence, unlike other extraction methods, Pang et al.'s (2003) method produces automata, rather than 
pairs of templates, but the automata can be used in a similar manner. In recognition, for example, if 
two strings are accepted by the same automaton, they are paraphrases; and in generation, we could 
look for an automaton that accepts the input expression, and then output other expressions that can 
be generated by the same automaton. As with Barzilay and Lee's (2003) method, however, Pang et 
al.'s (2003) method is intended to extract mostly paraphrase, not simply textual entailment pairs. 

Bannard and Callison-Burch (2005) point out that bilingual parallel corpora are much easier to 
obtain, and in much larger sizes, than the monolingual parallel or comparable corpora that some 
extraction methods employ. Hence, they set out to extract paraphrases from bilingual parallel cor- 
pora commonly used in statistical machine translation (SMT). As already discussed in Section 3.1, 
phrase-based SMT systems employ tables whose entries show how phrases of one language may be 

44 Example from Pang et al.'s work (2003). 
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Figure 7: Merging parse trees of aligned sentences in Pang et al.'s method. 




Figure 8: Finite state automaton produced by Pang et al.'s method. 



replaced by phrases of another language; phrase tables of this kind may be produced by applying 
phrase alignment heuristics (Och & Ney, 2003; Cohn et al., 2008) to word alignments produced 
by the commonly used IBM models. In the case of an English-German parallel corpus, a phrase 
table may contain entries like the following, which show that "under control" has been aligned with 
"unter kontrolle" in the corpus, but "unter kontrolle" has also been aligned with "in check"; hence, 
"under control" and "in check" are a candidate paraphrase pair. 45 



English phrase 


German phrase 






under control 


unter kontrolle 






in check 


unter kontrolle 







More precisely, to paraphrase English phrases, Bannard and Callison-Burch (2005) employ a 
pivot language (German, in the example above) and a bilingual parallel corpus for English and the 
pivot language. They construct a phrase table from the parallel corpus, and from the table they 
estimate the probabilities P(e\f) and P(f\e), where e and / range over all of the English and pivot 
language phrases of the table. For example, P(e\f) may be estimated as the number of entries (rows) 
that contain both e and /, divided by the number of entries that contain /, if there are multiple rows 
for multiple alignments of e and / in the corpus, and similarly for P{f\e). The best paraphrase e\ of 
each English phrase e\ in the table is then computed by equation (91), where / ranges over all the 
pivot language phrases of the phrase table T. 

e^ = argmaxP(e 2 |e 1 )=argmax V P(/|e 1 )P(e 2 |/,e 1 ) waagmax Y P{f\e l )P{e 2 \f) (91) 

—1-/1 . —/- st . nr* —i- n , 
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45 



Example from the work of Bannard and Callison-Burch (2005). 



33 



Androutsopoulos & Malakasiotis 



Multiple bilingual corpora, for different pivot languages, can be used; (91) becomes (92), where C 
ranges over the corpora, and / now ranges over the pivot language phrases of C's phrase table. 

4 = argma X £ £ P(f\e x )P(e 2 \f) (92) 

«2^l c feT[c) 

Bannard and Callison-Burch (2005) also considered adding a language model (Section 3.1) to 
their method to favour paraphrase pairs that can be used interchangeably in sentences; roughly 
speaking, the language model assesses how well one element of a pair can replace the other in sen- 
tences where the latter occurs, by scoring the grammaticality of the sentences after the replacement. 
In subsequent work, Callison-Burch (2008) extended their method to require paraphrases to have the 
same syntactic types, since replacing a phrase with one of a different syntactic type generally leads 
to an ungrammatical sentence. 46 Zhou et al. (2006b) employed a method very similar to Bannard 
and Callison-Burch's to extract paraphrase pairs from a corpus, and used the resulting pairs in SMT 
evaluation, when comparing machine-generated translations against human-authored ones. Riezler 
et al. (2007) adopted a similar pivot approach to obtain paraphrase pairs from bilingual phrase ta- 
bles, and used the resulting pairs as paraphrasing rules to obtain paraphrases of (longer) questions 
submitted to a QA system; they also used a log-linear model (Section 3.1) to rank the resulting 
question paraphrases by combining the probabilities of the invoked paraphrasing rules, a language 
model score of the resulting question paraphrase, and other features. 47 

The pivot language approaches discussed above have been shown to produce millions of para- 
phrase pairs from large bilingual parallel corpora. The paraphrases, however, are typically short 
(e.g., up to four or five words), since longer phrases are rare in phrase tables. The methods can also 
be significantly affected by errors in automatic word and phrase alignment (Bannard & Callison- 
Burch, 2005). To take into consideration word alignment errors, Zhao et al. (2008) use a log-linear 
classifier to score candidate paraphrase pairs that share a common pivot phrase, instead of using 
equations (91) and (92). In effect, the classifier uses the probabilities P(f\e\) and P{e2\f) of (91)- 
(92) as features, but it also uses additional features that assess the quality of the word alignment 
between e\ and /, as well as between / and ei. In subsequent work, Zhao et al. (2009) also consider 
the English phrases e\ and <?2 to be paraphrases, when they are aligned to different pivot phrases 
fx and fi, provided that f\ and fa are themselves a paraphrase pair in the pivot language. Figure 9 
illustrates the original and extended pivot approaches of Zhao et al. The paraphrase pairs (f\,f2) 
of the pivot language are extracted and scored from a bilingual parallel corpus as in the original ap- 
proach, by reversing the roles of the two languages. The scores of the (f\,f2) pairs, which roughly 
speaking correspond to P(f2\f\), are included as additional features in the classifier that scores the 
resulting English paraphrases, along with scores corresponding to P{fi\e\), P{e2\f2), and features 
that assess the word alignments of the phrases involved. 

Zhao et al.'s (2008, 2009) method also extends Bannard and Callison-Burch's (2005) by pro- 
ducing pairs of slotted templates, whose slots can be filled in by words of particular parts of speech 
(e.g., "Nouni is considered by A/om«2" ~ "Nouri2 considers A/om«i"). 48 Hence, Zhao et al.'s patterns 
are more general, but a reliable parser of the language we paraphrase in is required; let us assume 
again that we paraphrase in English. Roughly speaking, the slots are formed by removing subtrees 

46 An implementation of Callison-Burch's (2008) method and paraphrase rules it produced are available on-line. 
47 Riezler et al. (2007) also employ a paraphrasing method based on an SMT system trained on question-answer pairs. 
48 A collection of template pairs produced by Zhao et al.'s method is available on-line. 
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Figure 9: Illustration of Zhao et al.'s pivot approaches to paraphrase extraction. 

from the dependency trees of the English sentences and replacing the removed subtrees by the POS 
tags of their roots; words of the pivot language sentences that are aligned to removed words of the 
corresponding English sentences are also replaced by slots. A language model is also used, when 
paraphrases are replaced in longer sentences. Zhao et al.'s experiments show that their method out- 
performs DIRT, and that it is able to output as many paraphrase pairs as the method of Bannard and 
Callison-Burch, but with better precision, i.e., fewer wrongly produced pairs. Most of the generated 
paraphrases (93%), however, contain only one slot, and the method is still very sensitive to word 
alignment errors (Zhao et al., 2009), although the features that check the word alignment quality 
alleviate the problem. 

Madnani et al. (2007) used a pivot approach similar to Bannard and Callison-Burch's (2005) 
to obtain synchronous (normally bilingual) English-to-English context-free grammar rules from 
bilingual parallel corpora. Parsing an English text with the English-to-English synchronous rules 
automatically paraphrases it; hence the resulting synchronous rules can be used in paraphrase gen- 
eration (Section 3). The rules have associated probabilities, which are estimated from the bilingual 
corpora. A log-linear combination of the probabilities and other features of the invoked rules is 
used to guide parsing. Madnani et al. employed the English-to-English rules to parse and, thus, 
paraphrase human- authored English reference translations of Chinese texts. They showed that us- 
ing the additional automatically generated reference translations when tuning a Chinese-to-English 
SMT system improves its performance, compared to using only the human-authored references. 

We note that the alignment-based methods of this section appear to have been used to extract 
only paraphrase pairs, not (unidirectional) textual entailment pairs. 

4.4 Evaluating Extraction Methods 

When evaluating extraction methods, we would ideally measure both their precision (what percent- 
age of the extracted pairs are correct paraphrase or textual entailment pairs) and their recall (what 
percentage of all the correct pairs that could have been extracted have actually been extracted). As 
in generation, however, recall cannot be computed, because the number of all correct pairs that 
could have been extracted from a large corpus (by an ideal method) is unknown. Instead, one may 
again count the number of extracted pairs (the total yield of the method), possibly at different pre- 
cision levels. Different extraction methods, however, produce pairs of different kinds (e.g., surface 
strings, slotted surface templates, or slotted dependency paths) from different kinds of corpora (e.g., 
monolingual or multilingual parallel or comparable corpora); hence, direct comparisons of extrac- 
tion methods may be impossible. Furthermore, different scores are obtained, depending on whether 
the extracted pairs are considered in particular contexts or not, and whether they are required to be 
interchangeable in grammatical sentences (Bannard & Callison-Burch, 2005; Barzilay & Lee, 2003; 
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Callison-Burch, 2008; Zhao et al., 2008). The output of an extraction method may also include pairs 
with relatively minor variations (e.g., active vs. passive, verbs vs. nominalizations, or variants such 
as "the X company bought Y vs. "X bought Y"), which may cause methods that produce large num- 
bers of minor variants to appear better than they really are; these points also apply to the evaluation 
of generation methods (Section 3.3), though they have been discussed mostly in the extraction liter- 
ature. Detecting and grouping such variants (e.g., turning all passives and nominalizations to active 
forms) may help avoid this bias and may also improve the quality of the extracted pairs by making 
the occurrences of the (grouped) expressions less sparse (Szpektor & Dagan, 2007). 

As in generation, in principle one could use a paraphrase or textual entailment recognizer to 
automatically score the extracted pairs. However, recognizers are not yet accurate enough; hence, 
human judges are usually employed. When extracting slotted textual entailment rules (e.g., "X 
painted Y" textually entails "Y is the work of X"), Szpektor et al. (2007) report that human judges 
find it easier to agree whether or not particular instantiations of the rules (in particular contexts) 
are correct or incorrect, as opposed to asking them to assess directly the correctness of the rules. 
A better evaluation strategy, then, is to show the judges multiple sentences that match the left- 
hand side of each rule, along with the corresponding transformed sentences that are produced by 
applying the rule, and measure the percentage of these sentence pairs the judges consider correct 
textual entailment pairs; this measure can be thought of as the precision of each individual rule. 
Rules whose precision exceeds a (high) threshold can be considered correct (Szpektor et al., 2007). 

Again, one may also evaluate extraction methods indirectly, for example by measuring how 
much the extracted pairs help in information extraction (Bhagat & Ravichandran, 2008; Szpektor 
& Dagan, 2007, 2008) or when expanding queries (Pasca & Dienes, 2005), by measuring how 
well the extracted pairs, seen as paraphrasing rules, perform in phrase alignment in monolingual 
parallel corpora (Callison-Burch et al., 2008), or by measuring to what extent SMT or summarization 
evaluation measures can be improved by taking into consideration the extracted pairs (Callison- 
Burch et al, 2006a; Kauchak & Barzilay, 2006; Zhou et al., 2006b). 

5. Conclusions 

Paraphrasing and textual entailment is currently a popular research topic. Paraphrasing can be seen 
as bidirectional textual entailment and, hence, similar methods are often used for both. Although 
both kinds of methods can be described in terms of logical entailment, they are usually intended to 
capture human intuitions that may not be as strict as logical entailment; and although logic-based 
methods have been developed, most methods operate at the surface, syntactic, or shallow semantic 
level, with dependency trees being a particularly popular representation. 

Recognition methods, which classify input pairs of natural language expressions (or templates) 
as correct or incorrect paraphrases or textual entailment pairs, often rely on supervised machine 
learning to combine similarity measures possibly operating at different representation levels (sur- 
face, syntactic, semantic). More recently, approaches that search for sequences of transformations 
that connect the two input expressions are also gaining popularity, and they exploit paraphrasing 
or textual entailment rules extracted from large corpora. The RTE challenges provide a significant 
thrust to recognition work, and they have helped establish benchmarks and attract more researchers. 

Generation methods, meaning methods that generate paraphrases of an input natural language 
expression (or template), or expressions that entail or are entailed by the input expression, are cur- 
rently based mostly on bootstrapping or ideas from statistical machine translation. There are fewer 
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Main ideas discussed 


R-TE 


R-P 


G-TE 


G-P 


E-TE 


E-P 


Logic-based inferencing 


X 


X 










Vector space semantic models 




X 










Surface string similarity measures 


X 


X 










Syntactic similarity measures 


X 


X 










Similarity measures on symbolic meaning representations 


X 


X 










Machine learning algorithms 


X 


X 




X 




X 


Decoding (transformation sequences) 


X 


X 




X 






Word/sentence alignment 


X 






X 




X 


Pivot language(s) 








X 




X 


Bootstrapping 






X 


X 


X 


X 


Distributional hypothesis 






X 


X 


X 


X 


Synchronous grammar rules 








X 




X 



Table 3: Main ideas discussed and tasks they have mostly been used in. R: recognition; G: genera- 
tion, E: extraction; TE: textual entailment, P: paraphrasing. 



publications on generation, compared to recognition (and extraction), and most of them focus on 
paraphrasing; furthermore, there are no established challenges or benchmarks, unlike recognition. 
Nevertheless, generation may provide opportunities for novel research, especially to researchers 
with experience in statistical machine translation, who may for example wish to develop alignment 
or decoding techniques especially for paraphrasing or textual entailment generation. 

Extraction methods extract paraphrases or textual entailment pairs (also called "rules") from 
corpora, usually off-line. They can be used to construct resources (e.g., phrase tables or collec- 
tions of rules) that can be exploited by recognition or generation methods, or in other tasks (e.g., 
statistical machine translation, information extraction). Many extraction methods are based on the 
Distributional Hypothesis, though they often operate at different representation levels. Alignment 
techniques originating from statistical machine translation are recently also popular and they allow 
existing large bilingual parallel corpora to be exploited. Extraction methods also differ depending 
on whether they require parallel, comparable, or simply large corpora, monolingual or bilingual. As 
in generation, most extraction research has focused on paraphrasing, and there are no established 
challenges or benchmarks. 

Table 3 summarizes the main ideas we have discussed per task, and Table 4 lists the correspond- 
ing main resources that are typically required. The underlying ideas of generation and extraction 
methods are in effect the same, as shown in Table 3, even if the methods perform different tasks; 
recognition work has relied on rather different ideas. Generation and extraction have mostly focused 
on paraphrasing, as already noted, which is why fewer ideas have been explored in generation and 
extraction for (unidirectional) textual entailment. 

We expect to see more interplay among recognition, generation, and extraction methods in the 
near future. For example, recognizers and generators may use extracted rules to a larger extent; 
recognizers may be used to filter candidate paraphrases or textual entailment pairs in extraction 
or generation approaches; and generators may help produce more monolingual parallel corpora 
or recognition benchmarks. We also expect to see paraphrasing and textual entailment methods 
being used more often in larger natural language processing tasks, including question answering, 
information extraction, text summarization, natural language generation, and machine translation. 
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Main ideas discussed 


Main typically required resources 


Logical-based inferencing 

Vector space semantic models 
Surface string similarity measures 

Syntactic similarity measures 
Similarity measures operating on 
symbolic meaning representations 
Machine learning algorithms 
Decoding (transformation sequences) 
Word/sentence alignment 

Pivot language(s) 
Bootstrapping 
Distributional hypothesis 
Synchronous grammar rules 


Parser producing logical meaning representations, inferencing engine, 
resources to extract meaning postulates and common sense knowledge from. 
Large monolingual corpus, possibly parser. 

Only preprocessing tools, e.g., POS tagger, named-entity recognizer, which 

are also required by most other methods. 

Parser. 

Lexical semantic resources, possibly parser and/or semantic role labeling 
to produce semantic representations. 

Training/testing datasets, components/resources needed to compute features. 

Synonyms, hypernyms-hyponyms, paraphrasing/TE rules. 

Large parallel or comparable corpora (monolingual or multilingual), possibly 

parser. 

Multilingual parallel corpora. 

Large monolingual corpus, recognizer. 

Monolingual corpus (possibly parallel or comparable). 

Monolingual parallel corpus. 



Table 4: Main ideas discussed and main resources they typically require. 
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Appendix A. On-line Resources Mentioned 
A.l Bibliographic Resources, Portals, Tutorials 

ACL 2007 tutorial on textual entailment: http://www.cs.biu.ac.il/~dagan 
/TE-Tutorial-ACL07.ppt. 

ACL Anthology: http://www.aclweb.org/anthology/. 

Textual Entailment Portal: http://www.aclweb.org/aclwiki/index.php? 
title=Textual_Entailment_Portal. 



A.2 Corpora, Challenges, and their Datasets 

Cohn et al.'s paraphrase corpus: Word-aligned paraphrases; 
http://www.dcs.shef.ac.uk/~tcohn/paraphrase_corpus.html. 

FATE: The RTE-2 dataset with FrameNet annotations; 
http://www.coli.uni-saarland.de/projects/salsa/fate/. 

MSR Paraphrase Corpus: Paraphrase recognition benchmark dataset; 

http://research.microsoft.com/en-us/downloads/607dl4d9-20cd-47e3-85bc-a2f65cd28042/. 
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Multiple-Translation Chinese Corpus: Multiple English translations of Chinese news articles; 
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp ?catalogId=LDC2002T01. 

RTE challenges, PASCAL Network of Excellence: Textual entailment recognition challenges and 
their datasets; http://pascallin.ecs.soton.ac.uk/Challenges/. 

RTE track of NIST's Text Analysis Conference: Continuation of pascal's rte; 
http://www.nist.gov/tac/tracks/. 

Written News Compression Corpus: Sentence compression corpus; 
http://jamesclarke.net/research/. 

A.3 Implementations of Machine Learning Algorithms 

LIBSVM: SVM implementation; http://www.csie.ntu.edu.tw/~cjlin/libsvm/. 

Stanford's Maximum Entropy classifier: http://nlp.stanford.edu/software/index.shtml. 

SVM-Light: SVM implementation; http://svmlight.joachims.org/. 

Weka: Includes implementations of many machine learning algorithms; 
http://www.cs.waikato.ac.nz/ml/weka/. 

A.4 Implementations of Similarity Measures 

EDITS: Suite to recognize textual entailment by computing edit distances; http://edits.fbk.eu/. 

WordNet::Similarity: Implementations of WordNet-based similarity measures; 
http://wn-similarity.sourceforge.net/. 

A.5 Parsers, POS Taggers, Named Entity Recognizers, Stemmers 
Brill's POS tagger: http://en.wikipedia.org/wiki/BrilLtagger. 

Charniak's parser: http://flake.cs.uiuc.edu/~cogcomp/srl/CharniakServer.tgz. 

Collin's parser: http://people.csail.mit.edu/mcollins/code.html. 

Link Grammar Parser: http://www.abisource.com/projects/link-grammar/. 

MaltParser : http ://w3 . msi. vxu . se/~nivre/research/MaltParser. html . 

MINIPAR: http://www.cs.ualberta.ca/~lindek/minipar.htm. 

Porter's stemmer: http://tartarus.org/~martin/PorterStemmer/. 

Stanford's named-entity recognizer, parser, tagger: http://nlp.stanford.edu/software/index.shtml. 
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A.6 Statistical Machine Translation Tools and Resources 

Giza++: Often used to train IBM models and align words; http://www.fjoch.com/GIZA++.html. 

Koehn's Statistical Machine Translation site: Pointers to commonly used SMT tools, resources; 
http ://www. statmt.org/. 

Moses: Frequently used SMT system that includes decoding facilities; 
http://www.statmt.org/moses/. 

SRILM: Commonly used to create language models; 
http ://w ww. speech, sri.com/proj ects/srilm/. 

A.7 Lexical Resources, Paraphrasing and Textual Entailment Rules 

Callison-Burch's paraphrase rules: Paraphrase rules extracted from multilingual parallel corpora 
via pivot language(s); the implementation of the method used is also available; 
http ://cs j hu.edu/~ccb/. 

DIRT rules: Template pairs produced by DIRT; http://demo.patrickpantel.com/. 

Extended WordNet: Includes meaning representations extracted from WordNet's glosses; 
http://wordnet.princeton.edu/. 

FrameNet: http://framenet.icsi.berkeley.edu/. 

Nomlex: English nominalizations of verbs; http://nlp.cs.nyu.edu/nomlex/ 

TEASE rules: Textual entailment rules produced by TEASE; 
http://www.cs.biu.ac.il/~szpekti/TEASE_collection.zip. 

VerbNet: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html. 

WordNet: http://xwn.hlt.utdallas.edu/. 

Zhao et al.'s paraphrase rules: Paraphrase rules with slots corresponding to POS tags, extracted 
from multilingual parallel corpora via pivot language(s); 
http://ir.hit.edu.cn/phpwebsite/index.php? 

module=documents&JAS _DocumentManager_op=viewDocument&JAS .Document id=268. 
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